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PREFACE TO THE REVISED EDITION 


During the fourteen years that have elapsed since the first edi- 
tion of this book was published there has been a very considerable 
extension of the use of statistical methods in business, in public 
administration, and in all the social sciences. The pressing require- 
ments of new tasks and new problems, together with increasing 
knowledge of statistical procedures on the part of administrative 
and research workers, have contributed to this extension. With 
this development, the older controversies over qualitative versus 
quantitative methods have largely been shelved. It is clear that 
different problems call for different procedures; that the men who 
are grappling with research problems differ, as regards the methods 
of analysis they find congenial and fruitful; that induction and 
deduction are complementary phases of the processes that lead to 
scientific advance. The choice of research procedures does not 
necessitate the acceptance of one method and the rejection of an- 
other; it calls for the finding of a blend of methods that is adapted 
to a particular set of problems, and that is suited to the tempera- 
ment and abilities of the human agent that employs them.- For 
workers dealing with social and economic relations, statistical 
methods constitute an essential element of this blend. Knowledge 
of systematic procedures for handling quantitative data, and skill 
in their use, are necessary parts of the equipment of students of the 
social sciences and of public and private administrators who must 
utilize the facts of experience in the formulation of policies. 

Gains on this front have been paralleled by notable improve- 
ments in statistical techniques. The post-war years have witnessed, 
in this field, the initiation of such another period of intellectual 
ferment and creative activity as that which, earlier, brought the 
great contributions of Karl Pearson and his associates. The older 
instruments of quantitative analysis have been refined and sharp- 
ened; methods of designing statistical experiments and formulat- 
ing and testing hypotheses have been improved; statistical infer- 
ence has been placed on a sounder foundation. There can be no 
doubt that these continuing improvements in the logic and in the 
technique of statistics will contribute in important ways to the 
advance of the social sciences and to the betterment of public and 
private administration. 
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viii PREFACE TO REVISED EDITION 

In preparing the present edition of Statistical Methods account 
has been taken of the more important of the recent developments 
that have a bearing on the economic and business applications of 
statistics. In doing this I have sought to retain the main features 
of the first edition. A systematic development of the fundamentals 
of statistical method is needl'd by the beginning student. A work- 
ing compendium of procedures, with necessary aids to calculation 
and reference tables, is required by the statistician engaged in ad- 
ministration or research. The book is designed to meet these two 
needs. 

The eighteen chapters of the present edition fall into two main 
divisions. The first twelve chapters deal with the descriptive* as- 
pects of statistics. Induction and sampling are purposely omitted 
in this development of basic descriptive procedures. Problems of 
statistical inference, with certain more advanced aspects of statis- 
tical description, are discussed in the last six chapters, and in 
appendices A to E. This organization is, I think, well adapted to 
the needs of instruction. Some teachers may, indeed, prefer to 
introduce at an earlier point the concepts of samples and parent 
populations and the treatment of sampling errors. If so, selected 
pages from the chapter on elementary probabilities and the normal 
curve (Chapter XIII) and from the introductory chapter on induc- 
tion (Chapter XIV) may follow Chapter V in the sequence of 
study. 

In the chapters added to this edition I have sought to exemplify 
economic applications of flic* newer methods of analysis. These 
methods offer rich and, as yet, largely unexplored possibilities to 
research workers in the social sciences. In these sections I have 
drawn heavily on the path-breaking work of R. A. Fisher. I am 
indebted to Dr. Fisher and his publishers, Oliver and Boyd of 
Edinburgh, for permission to include in this book the tabulations 
that appear in certain of the Appendix Tables. Those, with tin.* 
other tables included, arc* designed to make the present book a 
reasonably complete working manual adapted to the needs of both 
laboratory worker and student. 

I must reaffirm my thanks to those who assisted me in various 
ways in the preparation of the first edition. I am indebted, in 
addition, to Jacob M. Could, Agnes B. Omundson, and William II. 
Mills for valuable aid in the details of the revision. 

F. C. M. 

May, 1938. 



PREFACE TO THE FIRST EDITION 


The last decade has witnessed a remarkable stimulation of 
interest in quantitative methods in business and in the social 
sciences. The day when intuition was the chief basis of business 
judgment and unsupported hypothesis the mode in social studies 
seems to have passed. Following the lead of workers in the older 
and traditionally more accurate physical sciences, social scientists 
and serious students of business are employing in greater measure 
than ever before a method of study based upon the observation 
and analysis of facts. When these observations are quantitative 
in character appropriate methods are necessary for their organiza- 
tion and interpretation. This book deals with methods of com- 
bining and analyzing such observations, with primary emphasis 
upon materials drawn from the fields of economics and business. 

The justification for limiting the treatment to these particular 
fields is two-fold. Although general statistical methods are prac- 
tically universal in their application, special problems are en- 
countered in every field of study. This is particularly true in the 
realm of economies, which presents many distinctive difficulties 
and many characteristic problems. Methods that are in some 
degree specialized to meet these particular requirements have 
been developed, and these methods call for treatment in a work 
that is restricted in scope. In the second place, methods can 
be most effect ively explained in terms of particular subjects; ab- 
stract methodology is barren of interest to the average person. 
For these reasons the book has been written with reference to the 
specific, needs of quantitative workers in economics and business. 

In the explanation of methods no attempt has been made to 
secure the brevity of exposition which may be desirable in a 
strictly mathematical work. The purpose throughout has been 
to write for the learner not for the finished master, and the expla- 
nations have been prepared with the needs of the former in mind. 
I have felt free* to omit certain detailed demonstrations of theorems 
because this book is presented as an introduction to the subject, 
not as an exhaustive treatise. 

The methods of quantitative analysis that are in general use 
today represent a long accretion, an accumulation of contribu- 
tions from workers in many fields. It would be vain to attempt to 
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X PREFACE TO THE FIRST EDITION 

enumerate all the individuals who have contributed to the develop- 
ment of the science of statistics. Individual references are given 
in particular cases in the body of the text, but no list of such ac- 
knowledgments can serve as a complete record of the debt modern 
statisticians owe to their predecessors. 

For assistance in the preparation of the material contained in 
this book I am under many obligations. To Mr. H. E. Anderson 
and Professor H. B. Killough I am indebted for certain of the data 
employed in Chapters XI, XVI, and XVII. Professor Warren M. 
Persons of the Harvard Committee on Economic Research has 
courteously permitted me to make use of certain results of his work 
on commodity price index numbers. The index of industrial activ- 
ity presented in Chapter IX and utilized in Chapter XI is a product 
of the Statistical Division of the American Telephone and Tele- 
graph Company. I have employed it with the kind permission of 
Mr. Seymour L. Andrew, Chief Statistician. Suggestions from 
Professor A. H. Mowbray of the University of California have en- 
abled me to remove several obscurities that were present in an 
earlier mimeographed edition. I am deeply grateful to Professors 
Henry L. Moore, Theodore II. Brown, and Henry Schultz for their 
help in critically reviewing portions of the manuscript. For assist- 
ance at every stage of the work involved in the writing of this book 
I am under deep obligation to Professor Donald H. Davenport. 
His aid in the collection of material, in the preparation of charts, 
and in the onerous task of seeing the book through the press has 
been invaluable. To my wife, above all others, I am indebted 
for a measure of constant and generous help that cannot be ad(*- 
quately acknowledged here. 

F. C. M. 

November, 1924. 
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CHAPTER I 


STATISTICAL METHODS AND THE PROBLEMS 
OF ECONOMICS AND BUSINESS 

The distinction between economics and business rests upon 
viewpoint and approach, rather than subject matter. The 
economist and the business man have different objectives, 
but the substance of the science of economics and the mate- 
rials with which the art of business administration deals 
are in large part the same. In this treatise we are con- 
cerned with methods that may be employed in handling this 
common subject matter. 

Classes of Business Activity 

The tasks that confront business men may, without undue 
straining, be placed in three classes. First, in logical se- 
quence, are the technical tasks that arise in the processes 
of production, involving problems of chemistry and physics, 
of engineering, of animal husbandry, of navigation. The 
basic technical knowledge called for in the solution of these 
problems furnishes the foundation of our economic life. 
This is the domain of the hard-won arts of handling the 
raw materials and controlling the forces of nature. 

In the second class come activities that are connected 
with the internal organization and administration of indi- 
vidual business units. The technical functions of manipulat- 
ing organic and inorganic matter for the satisfaction of 
human wants are performed through administrative units, 
single farms, mines, factories, railroads, department stores. 
A whole new division of problems is faced by the business 
man in organizing these units, in coordinating the work of 
the different departments, in supervising the daily activities 
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of the individuals making up each organization. While these 
are perhaps less fundamental than the technical problems of 
production, they are, for the average business man, more 
pressing and more difficult. Scientific method has made 
less progress in solving these latter problems. There is not 
the organized body of knowledge which is found in the 
former field, nor are there the same trained experts to whom 
the tasks may be delegated. 

The two types of economic activity named above include 
tasks that are in a sense self-centered and controllable. The 
manufacturer of steel has his technical problems of smelt- 
ing and refining, his particular administrative duties. The 
farmer or mine-owner faces the same types of problems, in 
forms peculiar to his own situation. In the performance of 
tasks in these fields each man is dealing with problems all 
the elements of which are under more or less perfect control. 
Difficulties arise, but these are ordinarily difficulties inherent 
in the given task, not difficulties arising from a sudden change 
in the constituent elements of the problem, or the sudden 
interjection of a new factor. In this respect the third cate- 
gory of tasks to be performed by the business man differs 
materially from the first two. For this class is composed of 
problems the elements of which are subject only in part 
to control by the individuals directly concerned. 

This third division includes buying and selling, and all 
the attendant activities that are carried on in terms of 
prices. As economic life is at present organized these func- 
tions are, to the business man, the most important ones he 
performs. The technical tasks of production and of internal 
organization and administration are but means to an end. 
For the business man the goal of economic activity is the 
disposal of his product at a profit. The tasks preliminary to 
this final sale are of necessity subordinated to it, and so 
performed that the final aim may be achieved. The point 
of emphasis here is that the business man, in buying and 
selling, faces problems containing elements which he cannot 
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control. In securing his raw material, in bringing together 
the other agents needed in production, and in the final dis- 
posal of his product, the business man deals with markets — 
commodity markets, labor markets, money markets — and 
finds himself acting in relation to a system of prices quite 
beyond his control in its major movements. The other 
less fundamental phases of his activity are subject to a high 
degree of control, but when the business man comes to the 
final and most important act, the profitable sale of his prod- 
uct, his power of control dwindles. The motivating force in 
business activity is the hope of pecuniary profits, pecuniary 
profits depend upon successful buying and selling, successful 
buying and selling depend upon favorable conditions in an 
uncontrollable world of prices — here is the argument that 
states the major problem of business. And these are the 
facts which make the price system the dominating and 
all-important factor in modern business life. 

The modern entrepreneur lives in an environment of 
prices. The term “environment” is not an unapt figure; 
this world of prices in which the business man functions 
constitutes a coherent, consistent, well-articulated system 
of interdependent parts, a system which encompasses all 
the business activities of the entrepreneur. Since the system 
is beyond the control of the individual he must adapt him- 
self to it, and must base his activities upon as complete an 
understanding of the system as he may obtain. Without 
this understanding the major problems of business are in- 
capable of solution. 

Quantitative Character ok Economic and Business 

Problems 

Problems falling in the first of the classes outlined above 
have long been recognized as essentially quantitative in 
character. Their solution calls for the application of the 
methods of precision which have been developed in the 
physical sciences. It is no less true that the strictly eco- 
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nomic and business problems falling in the other classes 
require the employment of quantitative methods. Quali- 
tative considerations enter, of course, in the solution of such 
problems, helping to determine the questions to be asked 
and the methods to be employed. But facts, measured, 
weighed and compared with other facts, constitute the basis 
of business judgments and the foundation of economic rea- 
soning. Statistical methods provide means of organizing 
and appraising these facts. 

Of the three classes of problems distinguished in the pre- 
ceding section two come within the scope of the present dis- 
cussion. Though the methods of statistics are in part ap- 
plicable to the solution of technical problems of production, 
it is not the purpose of the present work to develop this 
subject. For the solution of problems in the two other 
fields — those connected with tire internal organization and 
administration of business units and with the processes of 
buying and selling that bring the business man into contact 
with the price system — methods of statistical analysis are 
peculiarly appropriate. 

Statistical Methods and Problems of Internal 
Administration 

The typical business man, in the administration of his 
organization, is called upon to deal with masses of measure- 
ments. He is dealing with tons of coal, cubic feet of gas, 
or kilowatt hours of energy consumed; with tons of pig iron 
or pairs of shoes produced; with machine hours and man 
hours; with wages, costs of production and selling prices 
expressed in dollars and cents. With the increasing size of 
the business unit the data with which the administrator 
must deal become increasingly complicated and numerous, 
and it becomes increasingly difficult to determine their true 
significance. Under intuitive or rule-of-thumb methods of 
administration it is impossible effectively to analyze large 
masses of data and to control business units above the 
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average in size. It has been abundantly demonstrated that 
the law of decreasing returns comes into play in business 
largely because of administrative difficulties. 

Whenever one deals with masses of data the problem is 
one of condensation and analysis — condensation and sim- 
plification in order that it may be possible for limited human 
faculties to handle the data, analysis (and comparison) in 
order that the elements of the problem may be distinguished 
and their significance appreciated. Statistical methods have 
been developed to facilitate the condensation and analysis 
of masses of quantitative data. 

As a typical example of such a problem may be mentioned 
the allocation of costs, an operation which has been called 
cost accounting. The proper analysis of all the factors 
which enter into this problem is only possible through the 
use of statistical methods. Accounting methods, restricted 
to the treatment of pecuniary units, are inadequate for the 
complete analysis of the items of expense. The analysis of 
sales records, again, calls for the condensation of masses of 
data, their representation in simple, understandable form, 
and their interpretation in relation to other business meas- 
urements. The analysis of markets and the study of purchas- 
ing records and commodities require the use of quantitative 
methods not restricted in their application to any one class 
of measurements. At every hand in internal administration 
statistical methods may be used to supplement accounting 
methods, to extend the knowledge of the executive, and to 
make more effective the control of business operations. 

Statistical Methods and External Problems 

New problems are encountered when the business man 
goes into the market to buy or sell. Continually before him 
are the phenomena of business cycles, and if he is to adapt 
his producing and marketing policies to the swings of the 
cycle he must undertake the analysis of these phenomena, 
employing tools appropriate to the task. Again, the price 
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system, the movements of which are of such fundamental 
interest to the business man, requires analysis through the 
use of quantitative methods. So complex and numerous are 
the data to be dealt with here that simplification is impera- 
tive. Apart somewhat from the immediate interests of the 
business man, but of dominant importance to the economist, 
are all the problems connected with the economic process of 
distribution, the allocation of income and wealth among 
the agents of production. These, as well as that other great 
economic problem concerned with the question of value or 
price determination, arc quantitative problems, to be solved 
through the use of quantitative methods of research. 

Statistical Procedures in Research 

What are these methods, and wherein does research em- 
ploying such methods differ from other types of research? 
Scientific inquiry, whatever its particular method may be, 
proceeds through careful observation, logical inference and 
accurate verification. Quantitative methods differ from 
others only in that observation, inference, and verification 
are based upon measurement. Until measurement is possible 
in a science it is unavoidable that its observations and find- 
ings should lack precision, no matter how brilliant the flashes 
of intuition nor how painstaking the labors of its students 
may be. The employment of methods of measurement, 
making possible the analysis of the factors involved in terms 
of precise units, gives to a science some of the advantages 
that sharp-edged tools have over blunt and unreliable instru- 
ments. Mathematics and its offspring, statistics and ac- 
counting, are the powerful instruments which the modern 
economist has at his disposal, and of which business, through 
the development of research agencies and methods, is making 
constantly greater use. 

The tools of the statistician are merely certain mathe- 
matical methods, developed for particular types of research. 
These types of research were not economic in the original 
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development of statistical methods, but social, political, and 
anthropometric, with one line of development (that relating 
to the theory of probabilities) extending back through the 
field of logic; to the gaming table. Yet these tools, developed 
for work in restricted spheres, have been found to possess 
much wider applicability, and economics has been one of 
the newer fields in which the application of these methods, 
with appropriate alterations and additions, has had fruitful 
results. The economist has found his hand strengthened 
and the precision of his work materially increased by the 
new tools. And business, together with the more abstruse 
science of economics, has profited. 

Reference has been made to the possibility of condensa- 
tion and simplification through the use of statistical pro- 
cedures. Such simplification is of cardinal importance in 
economics and in the other social sciences today. These 
sciences, to be realistic, must be scrupulously faithful to 
fact, yet the masses of facts relating to current social proc- 
esses are, in their magnitude, almost a menace to effective 
analysis. “Already,” writes a reviewer in the Journal of the 
Royal Statistical Society, “economic analysis taxes language 
to its utmost, and it is a question how much longer mere 
verbal exposition will be able to control the swelling floods 
of observable data.” Though one may feel that these floods 
of data fail to provide many of the essential facts about 
social processes today, there is point to the reviewer’s com- 
plaint. In the light of this danger systematic procedures 
in the organization and analysis of data have an importance 
today that t hey did not have at an earlier time. Statistical 
methods constitute such procedures. By their use we may 
seek to channel and appraise the floods of data, relating to 
business operations and other social processes, that the 
fact-gathering agencies of business and government currently 
release upon us. 
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GRAPHIC PRESENTATION 

The explanation of methods of condensing, analyzing, and 
interpreting the facts of business and economics must start 
with the discussion of some fundamental considerations 
which are mathematical rather than statistical in character. 
In doing so it is deemed advisable, even at the risk of tread- 
ing quite familiar ground, to explain certain mathematical 
conceptions to which constant reference will be made in 
later chapters. 

Statistical analysis is concerned primarily with data based 
upon measurement, expressed either in pecuniary or physical 
units. The methods of coordinate geometry, developed first 
by the philosopher Descartes, greatly facilitate the manipu- 
lation and interpretation of such data. A summary of the 
basic principles of coordinate geometry will not be out of 
place. 


Rectangular Coordinates 

If two straight lines intersecting each other at right 
angles are drawn in a plane, it is possible to describe the 
location of any point in that plane with reference to the 
point of intersection of the two lines. We will call one of 
the lines (a vertical line) Y'Y, the other line (horizontal) X'X, 
and the point of intersection (or origin) 0 (cf. Fig. 1). If P 
be any point in the plane, we may draw the line PM, par- 
allel to Y'Y and intersecting X'X at M, and the line PN, 
parallel to X'X and intersecting Y'Y at N. If we set OM 
equal to g units and ON equal to h units, g and h constitute 
the coordinates of P, describing its location with reference 
to the origin 0. Thus, in Fig. 1, g equals 6 and h equals 5. 

s 
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The distance g along the x-axis is termed the abscissa of 
the point P, while the distance h along the y - axis is termed 
the ordinate of the point P. (It is a rule of notation always 
to give the abscissa first, followed by the ordinate.) The 
coordinates of any other point in the same plane may be 



Fig. 1 . — Location of a Point with Reference to Rectangular Coordinates 


determined in the same way. Conversely, any two real 
numbers determine a point in the plane, if one be taken as 
the abscissa and the other as the ordinate. 

A point may lie either to the right or left or above or 
below the origin, 0. It is conventional to designate as 
positive abscissas laid off to the right of the origin, and as 
negative abscissas laid off to the left of the origin, while 
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ordinates are positive when laid off above the origin and 
negative when laid off below the origin. In general, the 
values to be dealt with in economic statistics lie in the 
upper right-hand quadrant, where both abscissa and ordinate 
are positive. 

This conception of coordinates is fundamental in mathe- 
matics and of basic importance in statistical work. A very 
simple example will illustrate the utility of this device in 
representing business data. The figures presented in the 
following table may be employed. 

Table 1 

Production of Passenger Automobiles in the United States, by 
Months, During the Year 1.9,i7 



Number of 

Month 

pause tujer ears 
manufactured 

January 

309,037 

February 

290,030 

March 

403, S7!) 

April 

439,980 

May 

425,432 

June 

411,394 

July 

300,403 

August 

311,450 

September 

US. 071 

October 

298,002 

November 

295,328 

December 

244,385 


These data may be represented graphically on the co- 
ordinate system, months being laid off along the ./-axis and 
number of automobiles along the //-axis, as in the accom- 
panying diagram (Fig. 2 ). In plotting t I k; abscissas, Decem- 
ber, 1930, is considered as located at the point of origin. 
The .r-value of the entry for January, 1937, is thus 1, ot 
the February figure 2, etc. The coordinates of the point 
representing the number of cars produced in .January, 1937, 
are 1 and 309,037 ; for February the values are 2 and 290,030. 
The coordinates for December are 12 and 244,383. The 
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rhousands 
of cars 



Fig. 2. — Production of Passenger Automobiles, by Months, During the 

Year 1937 


movement of automobile production during the year may 
be more easily followed if the points are connected by a 
series of straight lines, as is done in the figure. 

Independent and Dependent Variables 
In the location of any point by means of coordinates it 
has been pointed out that two values are involved; every 
point tics together and expresses a relation between two 
lactors. In the above case these are months and number of 
passenger automobiles produced. With the passage of time 
the volume of automobile production changes, and the 
broken line shows the direction and magnitude of these 
changes. Both time and number of cars produced arc vari- 
ables, that is, they are quantities not of constant value but 
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characterized by variations in value in the given discussion. 
Thus in Fig. 1 the abscissa has a fixed value of 6, while the 
ordinate has a fixed value of 5, but in Fig. 2 both abscissa 
and ordinate have varying values, the one varying from 1 
to 12, the other from 118,671 to 439,980. The symbols x 
and y are, by convention, used to designate such variable 
quantities as these, the former in all cases representing the 
variable plotted along the horizontal axis, the latter rep- 
resenting the variable plotted along the vertical axis. 1 

In Fig. 2, which depicts the changes taking place in 
automobile production with the passage of time, it will be 
noted that the latter variable changes by an arbitrary unit, 
one month. Having made an independent change in the 
time factor we then determine the change in output taking 
place during the period thus arbitrarily chopped out. The 
variable which increases or decreases by increments arbi- 
trarily determined is called the independent variable, and is 
generally plotted on the .r-axis. The other variable is termed 
the dependent variable, and is plotted on the //-axis. This 
dependence may be real, in the sense that the values of the 
second variable arc definitely determined by the values of 
the independent variable, or it may be purely a conven- 
tional dependence of the type described. Time, it should 
be noted, is always plotted as independent, when it consti- 
tutes one of the variables. 

Function al Re lationship 

When the relationship between two variables is one of 
complete dependence, so that the value of y is uniquely 
determined by a given value of x, y is said to be a function 
of x. The general expression for such a relationship is 
y = f(x). Thus the speed at which a body is falling at a 

1 It should bo noted that letters at the end of the alphabet are used as 
symbols for variables , while letters at the beginning of the alphabet an* used as 
symbols for constants , i.e., quantities the values of which do not change in the 
given discussion. 
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given moment is a function of the time it has been falling, 
the pressure of a given volume of gas is a function of its 
temperature, the increase of a given principal sum of money 
at a fixed rate of interest is a function of time. If the values 
of the independent variable be laid off on the x-axis of a 
rectilinear chart and the corresponding values of the func- 
tion (i.e., the dependent variable) be laid off on the //-axis, 
a graphic representation of the function will be secured, in 
the form of a curve. 1 This concept of functional relationship 
is a very important one in statistical work. Some of the 
simpler functions may be briefly discussed. 

THE STRAIGHT LINE 

If two variables are so related that their values are always 
the same, their relationship is obviously of the form y = x. 
As a very simple example, the relation between the age of a 
tree and the number of rings in its trunk may be cited. 
A tree 0 years old will have G lings, for 20 years there will 
be 20 rings, and so on. This relationship may be represented 
on a coordinate chart, several sample values of x and y 
being taken. When these points are plotted and a line drawn 
through them, we secure a straight line passing through the 
origin and (assuming the two scales to be equal) bisecting 
the right angle XOY (cf. Fig. 3). 

Similarly, any equation of the first degree (i.e., not in- 
volving xy, or powers of x or y other than the first) may be 
represented by a straight line. The generalized equation can 
be reduced to the form y = a + bx, where a is a constant 
representing the distance from the origin to the point of 
intersection of the given line and the //-axis, and b is a con- 
stant representing the slope of the given line (that is, the 
tangent of the angle which the line makes with the hori- 
zontal). The constant term a is called the y-intercept. It is 
clear from the generalized equation of the straight line that 

1 The general term “curve” is used to designate any line, straight or curved, 
when located with reference to a coordinate system. 
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when x has a value of zero, y will be equal to this constant 
term. In the example given above (Fig. 3) a is equal to 0, 
and b to 1. The location of a given line depends upon the 
signs of a and b as well as upon their magnitudes. The prac- 
tical problem involved in the determination of any straight 



line is that of finding the values of a and b from the data, 
a problem which will appear in various forms in the discus- 
sion of statistical methods. 

These points may be illustrated by the plotting of a 
simple equation of the first degree. Thus, to construct 
the graph of the function, y = 2 + 3x, various values of 
x are assumed, and corresponding values of y are deter- 
mined. These may be arranged in the form of a table: 
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X 

y 


(2 + 3x) 

- 4 

- 10 

- 2 

- 4 

0 

2 

2 

8 

4 

14 

Plotting these values and connecting the plotted points, 

the graph illustrated in 

Fig. 4 is secured. It will be noted 

a 

Y 

/ 



-2 -1 0, 1 2 3 4 

Y' 


Fig. 4. — Graph of the Equation y = 2 + Zx 

that since this function is linear (that is, the graph takes 
the form of a straight line) any two of the points would have 
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been sufficient to locate the line. The ^-intercept is equal 
to the constant term 2, and the tangent of the angle which 
the given line makes with the horizontal (the slope of the line) 
is equal to 3, the coefficient of x. That this curve repre- 
sents the equation is proved by the fact that the equation is 
satisfied by the coordinates of every point on the curve, and 
that every pair of values satisfying the equation is represented 
by a point on the curve. It is characteristic of a linear 
relationship that if one variable be increased by a constant 
amount, the corresponding increment of the other variable 
will be constant. In the above case as x grows by constant 
increments of 2, for example, the constant increment of the 
j/- variable is 6. Series which increase in this way by con- 
stant increments are termed arithmetic series. 

Many examples of linear relationship between variables 
are found in the physical sciences. An example from the 
economic world is found in the growth of money at simple 
interest, that is, interest which is not compounded. If we 
let r represent the rate of simple interest, x the number of 
years, and y the sum to which one dollar will amount at 
the end of x years, the equation of relationship is of the 
form 

y = 1 + rx. 

Since in a given case r will be constant, this is of the simple 
linear type. In statistical work precise relationships of 
this type rarely if ever occur, but approximations to the 
straight line relationship are found constantly. 

NON-LINE A It RELATIONSHIP 

Non-linear functions are of many types, of which only 
a few of the more common will be discussed here. The 
student should be familiar with the general characteristics 
of the chief non-periodic curves, of which the parabolic 
and hyperbolic types, on the one hand, and the exponential 
type on the other, are the most important. The potential 
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series is mentioned as a more general form of rather wide 
utility. Of periodic functions the sine curve is briefly de- 
scribed, as a fundamental form. 

Functional relationships of the parabolic or hyperbolic 
form are quite common in the physical sciences, and such 
curves are found to fit certain classes of economic data. 
The general equation, when there is no constant term, is 
of the form y = ax b . The curve is parabolic when the ex- 
ponent b is positive, and hyperbolic when b is negative. 
The two following examples will serve to illustrate these 
types: 

Problem: To construct the graph of the function y = x 2 . 


X 

y 


O' 2 ) 

- 5 

25 

- 4 

16 

- 3 

9 

- 2 

4 

- 1 

1 

0 

0 

1 

1 

2 

4 

3 

9 

4 

16 

5 

25 


The graph is shown in Fig. 5. 

Problem: To construct the graph of the function y = x~\ 
for positive values of x. 


x y 

(ar 1 ) 

£ 3 

1 2 

1 1 

2 £ 

3 £ 

4 i 

5 £ 
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Fig. 5. — Parabola: Graph of the Equation y = x 2 

The graph of this function, an equilateral hyperbola, is 
shown in Fig. 6. It should be noted that this equation 

may also be written y = \ or xy = 1. 

Ju 

It is characteristic of relationships of this type that as x in- 
creases in geometric progression, y also increases in geometric 
progression. Thus, in the example of the parabola given above 
(y = x-), if we select the x values which form a geometric 
series, 1 the corresponding y values form a similar series: 

a; 1 2 4 8 16 32 

y 1 4 16 64 256 1,024 

Another class of functions is of the form represented by 
the equation y = ab x . In equations of this type one of the 
variable quantities occurs as an exponent; graphs repre- 

1 A geometric scries is one each term of which is derived from the preceding 
term bv the application of a constant multiplier. 
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Fig. 6. — Equilateral Hyperbola: Graph of the Equation y = x~ l 
(for positive values of x) 


seating such equations are called exponential curves. The 
example which follows illustrates the type. 

Problem: To construct the graph of the function y = 2 X , 
for positive values of x. 

x v 

(2*) 

0 1 

1 2 

2 4 

3 8 

4 16 

5 32 

6 64 
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This graph is shown in Fig. 7. 

It has been noted that the relationship between two 
variables which increase by constant increments (consti- 
tuting arithmetic series) may be represented by a straight 
line, and that the relationship between variables increasing 



Fio. 7. — Exponential Curve: Graph of the Equation y 2* (for posi- 
tive values of x) 

in geometric progression may be represented by either a 
parabola or a hyperbola. The exponential curve constitutes 
a hybrid type. It describes a relation in which one variable 
increases in arithmetic progression while the other increases 
in geometric progression. The figures given above illustrate 
this relationship. 
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Curves based upon relationships of the following type have 
been employed extensively in statistical inquiries : 

y=a+bx+cx 2 +dx 2 + .... 

The term potential series has been applied to equations of 
this type. Though such curves do not constitute parabolas 
of the strict conic section type, a curve based upon such 
an equation carried to the second power of x is termed a 
second degree parabola, to the third power of x, a third 
degree parabola, etc. No uniform and simple type is secured 
from this series. It is treated in more detail at a later point. 

Periodic functions constitute another distinct type, a class 
represented notably by electrical and meteorological rela- 
tions, though not confined to these fields. The character- 
istic feature of such relationships is that values of the de- 
pendent variable repeat themselves at constant intervals 
of the independent variable. The sine curve, the basic type 
of this class, is illustrated in the following example. 

Problem: To construct the graph of the function y = sin x. 


X 

y 

(angle in degrees) 

(sin x) 

0° 

.000 

30° 

.500 

G0° 

.866 

5)0° 

1.000 

120° 

.866 

150° 

.500 

180° 

.000 

210° 

- .500 

240° 

- .866 

270° 

- 1.000 

300° 

- .866 

330° 

- .500 

360° 

.000 

390° 

.500 


etc. 


The graph is shown in Fig. 8. 
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The full importance in statistical work of securing a 
mathematical expression for the relation between two vari- 
ables cannot be demonstrated until the subject has been 
further developed. One fundamental object is the deter- 
mination of physical or economic laws underlying observed 
phenomena. Another more practical object is the securing 



of a formula by means of which values of one variable may 
be approximated from given values of the other. Examples 
throughout the book will serve to illustrate how these objects 
are attained . 1 


Logarithms 

Logarithms, which play such an important part in gen- 
eral mathematical operations, are of equal importance in 
the manipulation of the raw materials of statistics. The 
nature of logarithms, and the methods by which they are 
employed to facilitate arithmetic processes, may be briefly 

1 A toiler discussion of different curve types is presented below, in the section 
dealing with the analysis of time series. 



LOGARITHMS 


23 


reviewed. This discussion is concerned only with the common 
system of logarithms of which the base is 10. 

Any positive number may be expressed as a power of 10. 
Thus 

1,000 = 10 X 10 X 10 = 10 3 
10,000 = 10 X 10 X 10 X 10 = 10 4 . 

In each case the exponent of 10 (the small number written 
above and to the right) indicates the number of times the 
figure 10 is repeated as a factor. For the integral powers 
of 10 the exponent is a whole number, but for the other 
numbers the exponent will contain a fractional value. Thus 
100 is equal to 10 raised to the power 2, or 10 2 ; 110 is equal 
to 10 raised to the power 2 . 04139, or 10 204139 . 

The exponent of 10, or the index of the power to which 10 
must be raised to equal a certain number, is called the loga- 
rithm of that number. The logarithm of 100 is 2, the logarithm 
of 110 is 2.04139, the logarithm of 998 is 2.99913. These 
figures all have reference to the base 10, though a system 
of logarithms might be developed on any base. In general, if 

a — b c 

logs a = c 

which may be read "the logarithm of a to the base b is 
equal to c.” The relation between the given number, the 
base and the logarithm, when the common system of loga- 
rithms is employed, may be easily remembered if the follow- 
ing relations are kept in mind : 

100 = 10 2 
logio 100 = 2. 

The logarithm of any number has two parts, the integral 
and the decimal. The whole number is called the charac- 
teristic, and the decimal portion is termed the mantissa. 
The former is determined in a given case by inspection, 
while the mantissa may be obtained from logarithmic tables. 
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The characteristic varies with the location of the decimal 
point, while the mantissa remains the same for any given 
combination of numbers. This fact is illustrated by the 
following figures: 

log of 8,450 = 3.92686 

log of 845 = 2.92686 

log of 84.5 = 1.92686 

log of 8.45 = 0 . 92686 

log of 0.845 = 9.92686 - 10 

log of 0.0845 = 8.92686 - 10 . 

In finding the natural number to which a given logarithm 
corresponds (such natural numbers are termed anti-loga- 
rithms), the mantissa determines the sequence of figures, 
while the whole number, or characteristic, determines the 
location of the decimal point. For example, in seeking the 
anti-logarithm of 2.17(500 it is found that the decimal 
.17609 follows the natural number 1,500, in a table of 
logarithms. Since the characteristic is 2, the natural num- 
ber desired must lie between 100 and 1,000, and must there- 
fore be 150. 

A brief study of the following figures, showing the pro- 
gression of numbers corresponding to certain powers of 10, 
will help to fix in mind the relations between the multiples 
of 10 and their logarithms, and will enable the characteristic 
of a desired logarithm to be readily determined. 

.0001 .001 .01 .1 1 10 100 1,000 10,000 

10~ 4 10~ 5 10~ 2 10 -' 10 ° 10 1 10 2 10 * 10 4 

The exponents of 10 in the lower row are the logarithms 
of the numbers in the upper row. 

It should be noted that the logarithms of all numbers 
from 0 to 1 are negative. Thus the logarithm of 0.845 is 
— 1 + .92686; this is written 9.92686 — 10. In covering 
the range of all positive natural numbers from zero to infin- 
ity, logarithms traverse all positive and negative values. 
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A negative natural number, therefore, can have neither a 
positive nor a negative logarithm. 

The advantage of thus expressing numbers as powers of 
10 lies in the fact that the ordinary arithmetic operations of 
multiplication, division, raising to powers and extracting 
roots arc greatly facilitated by this procedure. 

To multiply numbers, add their logarithms. The sum 
of the logarithms of the factors is the logarithm of their 
product. In general terms: 

a b X a c = a <h+c) . 

Specifically, putting a = 10, b — 2, c = 3: 

10 2 X 10* = (10 X 10) X (10 X 10 X 10) = 10 6 = 100,000 
100 X 1,000 = 100,000. 


To divide one number by another, subtract the loga- 
rithm of the latter from the logarithm of the former. The 

remainder is the logarithm of the desired quotient. 

In general terms: 

a b + a c = a (b ~ c) . 

Specifically, putting a = 10, b = 5, c = 2: 


10 5 10 2 = 


10 X 10 X 10 X 10 X 10 


10 X 10 


100,000 - 5 - 100 


10 3 = 1,000 

= 1 , 000 . 


To raise a given number to any power, multiply the 
logarithm of the number by the index of the power. The 

product is the logarithm of the desired power. 

In general terms: 

( a b y = a\ 

Specifically, putting a = 10, b = 3, c = 2: 

(10 s ) 2 = (10 X 10 X 10) X (10 X 10 X 10) = 10 6 = 1,000,000 

1 , 000 2 = 1 , 000 , 000 . 

To extract any root of a given number, divide the loga- 
rithm of the number by the index of the root. The quo- 
tient is the logarithm of the desired root. 
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In general terms: 

Specifically, putting a = 10, b = 3, c = 6: 

■^T0« = 10* = 102 = 100 

v 7 1 ,000,000 = 100. 

In summary: 

log (a X b) = log a + log 6 
log (a - 7 - 6) = log a — log b 
log a b = b X log a 
log v'a = log a -5- 6. 

These characteristic advantages of logarithms have been 
made use of in the construction of the slide rule, an instru- 
ment for reducing routine toil which should be familiar to 
all students of statistics. 

LOGARITHMIC EQUATIONS 

The graphic representation of data by means of a system 
of rectangular coordinates has been described above and 
some of the advantages of this method have been outlined. 
For many purposes it is desirable to plot logarithms rather 
than the natural numbers themselves. This may result in 
bringing out significant relations more distinctly, or it may 
serve greatly to simplify and facilitate tlx; manipulation of 
data. In particular, when it is possible through the use of 
logarithms to reduce a complex curve to the straight line 
form, a distinct gain has been made in the direction of 
simplicity of treatment and interpretation. 

A linear equation, it will be recalled, is of the general 
form y = a + bx, where a and b are constants which meas- 
ure, respectively, the //-intercept of the given line and the 
slope. The simplification of equations through the use of 
logarithms involves in all cases the substitution of log x or 
log y, or both, for the .r or y variables, thereby reducing an 
equation of a higher order to a simpler form. 
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This process may be illustrated with reference to the 
equation y = x 2 . When plotted on rectangular coordinates 
this equation gives a curve of the parabolic type (cf. Fig. 5). 


Natural Numbers 



Scale of Logarithms 


Fig. 9. - Graph of the Equation log?/ = 2 log a; (Logarithmic form of 
the equation y = x 2 ) 

Reduced to logarithmic form this becomes log y = 2 log x. 
This equation, in which the variables are log?/ and log x, 
is linear in form. It is plotted in Fig. 9, for positive values 
of log x. To indicate the relations involved, natural numbers 
corresponding to the logarithms are given on scales to the 



28 


STATISTICAL METHODS 


right and at the top of the figure. The natural numbers 
appearing on the scales constitute geometric series, while 
their logarithms form arithmetic series. Equal vertical 
distances on the chart, it will be noted, represent equal 
absolute increments on the scale of logarithms and equal 
percentage increments on the scale of natural numbers. 

The equation y = ox 3 can be reduced in the same way 
to log y = log 5 + 3 log x, a linear form. Similarly, all 
equations of the type y = ax b , that is to say, all simple 
parabolas and hyperbolas, can be reduced to the straight 
line form log y = log a + b log x. Graphically this means 
plotting the logarithms of the if s against the logarithms of 
the x’s. 

A different problem is presented by an equation of the 
type y = ab x , the graph of which is termed an exponential 
curve. Expressed in logarithmic form, we have log y = 
log a + x log b. This is also of the linear type, the two con- 
stants being log a and log b, while the variables are x 
and log y. If we plot the natural x’s and the logs of 
the if s, with this type of equation, a straight line will be 
secured. A curve of this type is discussed and illustrated 
below. 


LOGARITHMIC AND SEMI-LOGARITHMIC CHARTS 

There arc certain disadvantages to the plotting of loga- 
rithms, however. If a considerable number of points are 
being plotted the task of looking up the logarithms may be 
tedious, and, in addition, the original values, in which 
chief interest lies, will not appear on the chart. These 
difficulties may be avoided by constructing charts with the 
scales laid off logarithmically, but with the natural numbers 
instead of the logarithms appearing on the scales. This is 
an arrangement identical with that employed in the con- 
struction of slide rules. Thus, although the natural numbers 
are given on the scales, distances are proportional to the 
logarithms of the numbers thereon plotted. In Fig. 10 
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such a chart is presented, showing the graph of the equa- 
tion y = x 2 . 

A variation of this type of chart which is of great im- 
portance in statistical work is one which is scaled arith- 
metically on the horizontal axis and logarithmically on the 
vertical axis. This is equivalent, of course, to plotting the 



Fig. 10. — Graph of the Equation y — .r 2 (Plotted on paper with 
logarithmic scales) 

x’s on the natural scale and plotting the logarithms of the 
y’s. As was pointed out above, such a combination of 
scales reduces a curve of the exponential type to a straight 
line. Plotting paper of this semi-logarithmic or “ratio” 
type may be constructed with the aid of a slide rule or 
of logarithms, or may be purchased ready made. It is of 
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particular value in charting economic statistics, because of 
the fact that time is usually one of the variables in such 
cases, and it is desirable to plot this variable on the natural 
scale. 

Dollars 



Fig. 11. — The Compound Interest Law: Growth of $10.00 at Compound 
Interest at 6 per cent for 100 Years (Plotted on arithmetic scale) 


As an example of this type of curve the compound inter- 
est law may be used. If r be taken to represent the rate 
of interest, x the number of years, p the principal, and y 
the sum to which the principal amounts at the end of x 
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years (interest being compounded annually), an equation is 
secured of the form 

y = p(l + r) x . 

Expressed logarithmically this becomes 

log y = log p + x log (1 + r), 
the equation to a straight line. 



Years 

Fig. 12. — The Compound Interest Law: Growth of $10.00 at Com- 
pound Interest at 6 per cent for 100 Years (Plotted on semi-logarithmic 
or ratio scale) 

In Fig. 11 a curve representing the growth of ten dollars 
at compound interest at 6 per cent is plotted on the natural 
scale. This is the graph of the exponential equation 

y = 10(1 + .06) x 

y representing the total amount of principal and interest 
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at the end of x years. Figure 12 shows the same data plotted 
on semi-logarithmic paper, the exponential curve being re- 
duced to a straight line. 

The use of semi-logarithmic paper is not confined to 
cases in which an exponential curve is straightened out, 
for the significance of many types of data is most effectively 
brought out when charts of this type are used. These 
advantages are more fully explained below. 

. The Construction of Charts 

When the results of observations or statistical investi- 
gations have been secured in quantitative form, one of the 
first steps toward analysis and interpretation of the data 
is that of presenting these results graphically. Not only is 
such procedure of scientific value in paving the way for 
further investigation of relationships, but it serves an im- 
mediate practical purpose in visualizing the results. A visual 
stimulus opens up a far more direct path to our understand- 
ing and imagination than that afforded by the more recently 
developed processes of reasoning. The interpretation of a 
column of raw figures may be a difficult task; the same data 
in graphic form may tell a simple and easily understood story. 
For these reasons graphic methods of presentation have come 
to play a highly important part in the everyday activities 
of business, as well as in the laboratory and drafting room. 

It is beyond the scope of this book to present any detailed 
account of the multiplicity of graphs employed by engineers 
and statisticians today. Certain of the more important 
principles of graphic presentation may be briefly explained, 
however, and some of the chief types of graphs which are 
in daily use may be illustrated. Other examples appear in 
later chapters of this book. 

FACTORS GOVERNING THE SELECTION OF A CHART 

The selection of the type of chart to be employed in a 
given case will depend upon two general considerations. 
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The first of these relates to the character of the material 
to be plotted. While the data of a given problem may 
frequently be presented graphically in several different 
forms, there is generally one type of chart best adapted 
to that material. It may be true, also, that certain types 
would be quite inappropriate to the data in question. The 
selection of a type of chart to employ, therefore, must 
be made with the characteristics of the data clearly in 
mind. 

Perhaps more important is the purpose which the given 
chart is designed to serve. Each of the many types 
of charts in common use is appropriate to certain speci- 
fic purposes. It will bring out certain characteristics of 
the data or will emphasize certain relationships. There 
is no chart which is sovereign for all purposes. Until the 
purpose is clearly defined the best chart form cannot 
be selected. The following descriptions of a few stand- 
ard types will facilitate the selection of an appropriate 
form. 

CHARTS ADAPTED TO THE PLOTTING OF TIME SERIES 

In the graphic presentation of a time series, primary 
interest attaches to the chronological variations in the values 
of the data, to the general trend and to the fluctuations 
about the trend. If the purpose is to emphasize the absolute 
variations, the differences in absolute units between the 
values of the series at different times, a simple chart of the 
type illustrated in Fig. 13 will serve the purpose. This 
chart depicts annual wheat flour exports from the United 
States during the period 1913-1936. Both scales are arith- 
metic. Points representing the various annual values are 
shown and, to facilitate interpretation, these points are 
connected by a series of straight lines. The chart tells a 
simple story of year-to-year fluctuations, with a sharp ad- 
vance at the end of the World War, a decline as the post-war 
emergency passed, several years of moderate growth, and a 
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Fig. 13. — Wheat Flour Exports from the United States, 1913-1936 


severe decline as the world depression deepened in the early 
thirties. With respect to general make-up, the following 
points should be noted : 


1. The title constitutes a clear description of the material plotted 

and indicates the period covered. 

2. The vertical scale begins at the zero line, enabling a true 

impression to be gained of the magnitude of the fluctua- 
tions. 

3. The zero line and the line joining the plotted points are ruled 

more heavily than the coordinate lines. 

4. Figures for the scales are placed at the left and at the bottom 

of the chart. The vertical scale may be repeated at the 
right to facilitate reading. All figures are so placed that 
they may be read from the base as bottom or from the 
right hand edge of the chart as bottom. 


RATIO CHARTS 


5. The {/-values of the plotted points are given at the top of the 
chart. This practice is helpful, though not necessary, as 
the values may be presented in a separate table. 

ADVANTAGES OF THE RATIO CHART 

If relative rather than absolute variations are of chief 
concern, the chart employed should be of the semi-loga- 
rithmic type, scaled logarithmically on the y-axis and arith- 
metically on the z-axis. In such a chart equal percentage 
variations are represented by equal vertical distances, as 
opposed to the ordinary arithmetic type in which equal 
absolute variations are represented by equal vertical dis- 
tances. The argument for the use of the semi-logarithmic 
or ratio chart for the representation of time series is that, 
in general, the significance of a given change depends upon 
the magnitude of the base from which the change is meas- 



Fig. 14. — Production of Steel Ingots and Castings in the United States, 
1896-1936 (Plotted on semi-logarithmic scale) 


ured. That is, an increase of 100 on a base of 100 is as 
significant as an increase of 10,000 on a base of 10,000. In 
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each case there is an increase of 100 per cent. The absolute in- 
crease in the second case is 100 times that in the first case, 
and the two changes would show in this proportion on the 
arithmetic chart. They would show as of equal importance 
on the semi-logarithmic chart. 

Such a chart is presented in Fig. 14, which shows the 
course of steel production in the United States from 1896 
to 1936. The absolute magnitudes are plotted, but the 
vertical scale is so constructed as to represent variations 
from year to year in proportion to their relative magnitude. 



Fia. 15. — Exports of the United States, 1920-1936 Showing Total Ex- 
ports and Exports to Selected Areas (Monthly averages for the years 
named are plotted on an arithmetic scale) 


Certain distinctive advantages of the ratio or logarithmic 
ruling are brought out by a comparison of Fig. 15 and 
Fig. 16. The data presented graphically in these two charts 
are shown in Table 2: 
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Table 2 


Exports of the United States, 1920-1936 
(Monthly averages, in thousands of dollars) - 



To 

To 

To 

To 

Total to 

Grand 


France 

Germany 

Italy 

u nitea 
Kingdom 

Europe 

total 

1920 

$56,349 

$25,953 

$30,980 

$161,319 

$372,174 

$685,668 

1921 

18,745 

31,027 

17,955 

78,510 

196,992 

373,753 

1922 

22,247 

26,343 

12,575 

71,319 

173,613 

319,315 

1923 

22,678 

26,403 

13,961 

73,527 

174,451 

347,291 

1924 

23,472 

36,702 

15,595 

81,912 

203,775 

382,582 

1925 

23,358 

39,195 

17,096 

86,155 

216,979 

409,154 

1926 

22,000 

30,347 

13,117 

81,051 

192,512 

400,722 

1927 

19,065 

40,140 

10,971 

70,005 

192,576 

405,448 

1928 

20,058 

38,938 

13,510 

70,613 

197,912 

427,363 

1929 

22,133 

34,204 

12,831 

70,667 

195,070 

435,083 

1930 

18,663 

23,189 

8,369 

56,509 

153,198 

320,265 

1931 

10,152 

13,838 

4,568 

37,923 

95,040 

202,024 

1932 

9,297 

11,139 

4,095 

24,027 

65,358 

134,251 

1933 

10,143 

11,669 

5,103 

25,978 

70,815 

139,583 

1934 

9,642 

9,062 

5,381 

31,896 

79,150 

177,733 

1935 

9,751 

7,665 

6,035 

36,117 

85,770 

190,240 

1936 

10,795 

8,382 

4,900 

36,662 

86,694 

204,457 


(Data compiled by Bureau of Foreign and Domestic Commerce, U. 8. 
Department of Commerce.) 


If the six series are to be presented on a single chart, 
scaled arithmetically, a scale must be selected which will 
include the largest item recorded, $685,668,000. Such a 
scale reduces the relative importance of the smaller magni- 
tudes. From Fig. 15 it appears that during the period cov- 
ered by the chart very large fluctuations occurred in total 
exports, substantial but somewhat smaller movements oc- 
curred in exports to Europe, and that exports to the four 
individual countries suffered much less severe fluctuations. 
Such a picture is quite misleading. The true state of affairs 
is reflected in Fig. 16, in which the same data are plotted 
on paper with a semi-logarithmic ruling. Fluctuations in 
exports to the individual countries are here seen to have 
been relatively greater than the movements of total exports. 
For the purpose of comparing series which differ materially 



Fig. 16. — Exports of the United States, 1920-1936, Showing Total 
Exports and Exports to Selected Areas, with Scales of Increase, 
Decrease, and Comparison (Monthly averages for the years named 
are plotted on a semi-logarithmic scale) 
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with respect to the magnitude of the individual items, the 
arithmetic ruling is quite useless, giving a thoroughly dis- 
torted picture of the true relations. The ratio ruling permits 
a legitimate comparison. 

The scales printed below Fig. 16 emphasize certain very 
useful features of the logarithmic ruling. The scale of in- 
crease may be used to measure with a fair degree of accu- 
racy the increase in a given series between any two dates. 
A given vertical distance on the chart, it will be recalled, 
represents a constant percentage increase at all points on 
the chart. Thus the distance from 1 to 10, along the vertical 
scale, is the same as the distance from 100 to 1,000. Any 
vertical distance may be measured, and the percentage of 
increase which it represents may be determined by laying 
off the given distance along the scale of increase, which is 
always read from the bottom up. For example, to determine 
the degree of increase in total exports from 1932 to 1935, 
we measure the vertical distance between the points plotted 
for these two years. Laying off this distance along the scale, 
it is found to represent about a 40 per cent increase. 

The scale of decrease is used in a similar fashion. The 
vertical distance between any two points is measured, and 
the percentage decrease which it represents is determined 
by laying off the given distance on the scale from the top 
downward. The arrows indicate the direction in which the 
various scales are to be read. 

By means of the scale of comparison the percentage relation 
of one series to another at any time may be determined. 
For example, we may wish to know the percentage relation 
between exports to Europe and total exports in 1935. The 
vertical distance between the two plotted points is measured, 
and laid off on the scale of comparison, reading from the top 
downward. It is found to be approximately 45 per cent. 

Scales of the type illustrated above may be readily con- 
structed on a given chart by using the ratio ruling for the 
scale intervals. When a series of charts is prepared on 
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semi-logarithmic paper of a standard type it is convenient 
to construct such scales in a more permanent form, in the 
shape of special rulers. 

A ratio chart is particularly useful when interest attaches 
to rates of growth (or decline) over a considerable period of 
time. In such a case, the reading of the chart is facilitated 
by the plotting of straight diagonal lines indicating uniform 



Fio. 17. — Production of Rayon Filament Yarn in the United States, 
1912-1937, With Lines Defining Uniform Rates of Growth 


rates of change. These should radiate from a single point of 
origin. The procedure is illustrated in Fig. 17. The diagonal 
lines there shown indicate changes at uniform rates ranging 
from 10 per cent to 50 per cent per year. By reference to 
these lines the user of the chart may readily determine the 
approximate rate of growth of the plotted series between 
any two years. 

The chief advantages of the semi-logarithmic ruling in 
chart construction may be briefly summarized: 

1. A curve of the exponential type becomes a straight line when 
plotted on a semi-logarithmic chart. For example, a curve 
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representing the growth of any sum of money at compound 
interest takes the form of a straight line when so plotted. 

2. In any series, so long as the rate of increase or decrease remains 

constant the graph will be a straight line on this ruling. 

3. Equal relative changes are represented by lines having equal 

slopes. Thus two series increasing or decreasing at equal 
rates will be represented by parallel lines. 

4. Comparison of the rates of change in two or more series is 

effected by comparison of the slopes of the plotted lines. 

5. The semi-logarithmic ruling permits, at the same time, the 

plotting of absolute magnitudes and the comparison of 
relative changes. 

6. Comparison of series differing materially in the magnitude of 

individual items is possible with the semi-logarithmic chart. 

7. Percentages of change may be read and percentage relations 

between magnitudes determined directly from the chart. 


CHARTS FOR THE COMPARISON OF FREQUENCIES 

A different type of chart is called for when the object is 
the comparison of frequencies, that is, numbers of events 
or things of different classes. The following census figures 
may serve to illustrate the problem. 


Table 3 


Farms in New England States in 1935 


State 

Maine 

New Hampshire 
Vermont 
Massachusetts 
Rhode Island 
Connecticut 


Number of farms 
41,907 
17,695 
27,061 
35,094 
4,327 
32,157 


A graphic comparison of these six states with respect to num- 
ber of farms in 1935 is afforded by the bar diagram in Fig. 18. 
This is a simple but effective type of chart for this purpose. 

Further examples of this type of chart, as employed in 
the representation of frequency distributions, are contained 
in the next chapter. It is there shown how a frequency 
polygon or frequency curve may grow out of the simple 
bar diagram, when data of certain kinds are being handled. 
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Fig. 18. — Farms in New England States in 1935 


Such frequency curves constitute very important graphic 
types, but it will be more appropriate to treat them in 
full at a later point. 

CHARTS FOR THE REPRESENTATION OF COMPONENT PARTS 

It is frequently desirable in tabular and graphic presenta- 
tion to break up a total into its component parts, in order 
that changes in the parts as well as in the total may be fol- 
lowed. The table on page 43 exemplifies this procedure. 

These figures are presented graphically in Fig. 19, which 
reveals the varying post-war fortunes of different interests 
in American manufacturing industries. It is clear from the 
diagram that the general swings of material costs, labor costs 
and overhead costs in American manufacturing industries 
have paralleled the fluctuations in total value of products. 
Some of the movements of the component items are of ex- 
ceptional interest, however. Overhead costs (with which 
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Table 4 

Total Value of Products and Elements of Production Costs , 
Manufacturing Industries of the United States , 

1919-1935 


(Millions of dollars) 


Year 

Cost of 

Labor cost 

Overhead, cost 

Total value 

materials ' 

{wages) 

plus profits 1 2 

of products 

1919 

$37,233 

$10,462 

$14,347 

$62,042 

1921 

25,321 

8,202 

10,130 

43,653 

1923 

34,706 

11,009 

14,841 

60,556 

1925 

35,936 

10,730 

16,048 

62,714 

1927 

34,803 

10,836 

16,639 

62,278 

1929 

38,178 

11,607 

20,176 

69,961 

1931 

21,681 

7,173 

12,184 

41,038 

1933 

16,821 

5,262 

9,276 

31,359 

1935 

26,264 

7,545 

11,951 

45,760 



Fig. 19. — Total Value of Products and Elements of Production Costs, 
Manufacturing Industries of the United States, 1919-1935 


1 Including containers, fuel and purchased electric energy. 

J This item represents the difference between total direct costs (materials 
and wages) and total value of products. It includes overhead costs proper, 
plus salaries, taxes, profits, etc. 
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profits are here combined) showed a notable expansion be- 
tween 1921 and 1929. The great recession that followed 
squeezed all the elements of the total, forcing them to levels 
well below those of the 1921 depression. 


CUMULATIVE CHARTS 

In many cases chief interest in the development of a 
series attaches not to the value of each successive item but 
to the cumulated total of a number of such items. This 
may be so when a yearly production program has been 
laid out. In such a case it is the relation between cumu- 
lated production to date and scheduled production to date 
which is of major interest, and a chart form is needed 
which will enable this comparison to be made. The fol- 
lowing figures illustrate the type of data for which such 
charts are appropriate. 


Table 5 


Cumulative Production Schedule and Cumulative 
Output, 1936 

(Speedwell Automobile Company) 


Month 

Production 

schedule 

(cars) 

Cumulative 

production 

schedule 

(cars) 

Output 

(cars) 

Cumulative 

output 

(cars) 

January 

8,000 

8,000 

6,125 

6,125 

February 

10,000 

18,000 

9,250 

15,375 

March 

12,000 

30.000 

45.000 

10,514 

25,889 

April 

15,000 

15,131 

41,020 

May 

14,000 

59,000 

12,159 

53,179 

June 

12,000 

71,000 

13,250 

66,429 

July 

11,000 

82,000 

11,462 

77,891 

August 

10,000 

92,000 

10,531 

4,621 

88,422 

September 

6,000 

98,000 

93,043 

October 

9,000 

107,000 

9,843 

102,886 

November 

10,000 

117,000 

13,785 

116,671 

December 

10,000 

127,000 



It is assumed that this table represents the situation as 
of the end of November. 
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In Fig. 20 the two cumulative curves are plotted. The 
relation between actual and scheduled production at the 
end of each month is shown on the chart, and it is possible 
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Fig. 20. — Comparison of Scheduled and Actual Output (Cumulative) 
Speedwell Automobile Co. 1936 


from the scale to read the approximate amount by which 
production is behind schedule. By reference to the figures, 
which should always accompany the chart, the exact rela- 
tion may be determined. Such a chart has many ap- 
plications, some of which are illustrated in the following 
chapter. 
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THE GANTT PROGRESS CHART 

The flflinft data may be presented in a very effective form 
by makin g use of a type of chart developed by Mr. H. L. 
Gantt. An adequate description of this chart and of its 
many uses would far exceed the space which can be given 
to it here, but its characteristics may be indicated in a very 
brief account. 

Once a schedule has been drawn up, the Gantt chart 
may be utilized in checking actual accomplishment against 
the schedule. Having such a schedule as that given in 
Table 5, the monthly and annual quotas may be entered 
on a form similar to that shown in Fig. 21. The entry to 
the left of each monthly space indicates the amount sched- 
uled for production during that month. The entry to the 
right of each monthly space indicates the cumulated sched- 
uled production to the end of the given month. In this 
figure the results of the first two months’ operations are 
shown. The heavy black line indicates the cumulated 
actual production during this period, amounting to 15,375 
cars. The narrow upper lines in the January and February 
columns measure the actual production in each of those 
months. If actual production in either month had equaled 
the scheduled production the light line would extend across 
the full monthly space. When actual production in a given 
month exceeds the scheduled production a double light line 
appears. 

It should be noted that the spaces into which each monthly 
period is divided represent equal parts of a month but vary- 
ing amounts in terms of actual production. Thus the space 
representing one fifth of the January interval represents a 
production of 1,600 cars (the January quota being 8,000). 
The space representing one fifth of the April interval repre- 
sents 3,000 cars (the April quota being 15,000). In reading 
the chart in terms of absolute magnitudes reference must be 
had to the monthly quotas. 
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Fig. 22. — Comparison of Scheduled and Actual Output, 1936: Gantt Progress Chart (Showing the situation on 

November 30th) 
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The situation at the end of November is shown in Fig. 22. 
The arrow at the top of the diagram indicates the point of 
time actually reached. That actual production is slightly 
behind scheduled production is apparent from the relation 
between this arrow and the heavy black line, while the light 
lines indicating monthly production show that actual output 
has exceeded the monthly quota in five of the last six 
months. 

The Gantt chart has a great variety of applications in 
governmental and business organizations. The economy 
of space is such that developments in a number of depart- 
ments or districts may be shown on a single chart. It 
constitutes the simplest and most effective graphic method 
known for following the progress of work under way, for 
comparing actual accomplishment with an established pro- 
gram. And in so doing, it increases by so much the efficiency 
of administrative control. 

Preferred Practice for Graphic Presentation 

Graphic methods have been widely employed in the physi- 
cal and social sciences and in business, and the resulting 
diversity of uses has made it difficult to secure standardiza- 
tion of practice. To remedy this defect a committee repre- 
senting engineering, statistical and research organizations 
was organized in 1929, under the sponsorship of the American 
Society of Mechanical Engineers, for the purpose of formulat- 
ing principles of preferred practice in this field. This group, 
acting as a Sectional Committee of the American Standards 
Association, is compiling a code of preferred practice for 
graphic presentation. The first section of this code, dealing 
with time series charts, has been issued by the sectional com- 
mittee. This report furnishes an excellent summary of con- 
ventional procedures, with detailed recommendations con- 
cerning principles appropriate to the graphic presentation of 
t.inrip. series. Somewhat more specialized, although it deals 
with certain principles applicable to the entire field of 
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graphics, is another report of the same committee on charts 
suitable for use as lantern slides . 1 
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pared by Subcommittee on Preferred Practice in Graphic Presentation, A. H. 
Richardson, Chairman, of the Sectional Committee on Standards for Graphic 
Presentation under Procedure of American Standards Association. Sponsored 
by the American Society of Mechanical Engineers, 29 West 39th Street, 
New York. 68 pages. (Preliminary Report.) 

American Recommended Practice, Engineering and Scientific Charts for 
Lantern Slides. Prepared by Subcommittee on Engineering and Scientific 
Charts, Walter A. Shewhart, Chairman, of the Sectional Committee on Stand- 
ards for Graphic Presentation. American Society of Mechanical Engineers, 
New York. 10 pages. 



CHAPTER III 


THE ORGANIZATION OF STATISTICAL DATA: 
THE FREQUENCY DISTRIBUTION 

The task of the statistician engaged in business or eco- 
nomic research includes the organization, analysis and in- 
terpretation of quantitative data relating to business affairs 
and to economic conditions. To these fundamental opera- 
tions that of collecting the original data may be added, 
though more frequently data will be compiled directly from 
primary or secondary sources. 

At the outset it is necessary to distinguish between the 
problems arising in the analysis of time series and those 
involved in the organization and analysis of materials in 
connection with which the time factor does not enter. In 
studying a time series the primary object is to measure and 
analyze the chronological variations in the value of the 
variable. Thus one may study variations in sales over a 
period of years, fluctuations in the production of bituminous 
coal, or changes in the general level of prices. Quite differ- 
ent is the procedure in the study of such a problem as 
income distribution at a given time. In this case we are 
desirous of knowing how many people in the United States 
fall in each of a number of income classes. The general 
problem of organization in this latter class of cases is to 
determine how many times each value of a variable is re- 
peated and how these values are distributed. Data of this 
sort, when organized, constitute a frequency series, as opposed 
to the time or historical series. The methods appropriate 
to these two types of analysis differ fundamentally and will 
therefore be treated separately. In the present section we 
are concerned with the organization and preliminary analy- 
st) 
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sis of data in connection with which the time element, 
while it may be present, does not enter as a factor. 

Unorganized Data 

When quantitative data of the type with which the 
statistician works are presented in a raw state they appear 
as unorganized masses of material, without form or struc- 
ture. They may have been drawn from the production or 
sales records of a business establishment, or they may 
represent a miscellaneous collection of price quotations. If 
the data have been gathered by other agencies they may 
already have been arranged in the form of a general table, 
but this form may be entirely unsuited to the particular 
object in the mind of the investigator. The first task of 
the statistician is the organization of the figures in such 
a form that their significance, for the purpose in hand, may 
be appreciated, that comparison with masses of similar 
data may be facilitated, and that further analysis may be 
possible. Scientific method, it has been noted, involves 
observation, inference, and verification. Data, the results of 
observation, must be put into definite form and given 
coherent structure before the process of inference is possi- 
ble. 

The figures on page 52, representing the earnings during 
a given week of 210 individuals engaged in piece work in a 
certain manufacturing establishment, will serve as an ex- 
ample of such data in their raw state. 

The Array 

If these figures are arranged in order of magnitude some- 
thing will have been done toward securing a coherent 
structure. The range covered and the general distribution 
throughout this range will then be clear, and the way will 
be prepared for further organization. When so arranged 
the array on page 53 is secured. 
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Weekly Earnings op 210 Employees 


$26.25 

$28.70 

$24.15 

$29.75 

$29.20 

$30.60 

$23.40 

$24.75 

26.70 

24.35 

25.75 

27.20 

28.30 

25.25 

27.75 

27.60 

28.20 

27.30 

27.80 

26.35 

27.40 

28.30 

26.60 

25.75 

27.70 

28.60 

25.30 

27.80 

26.40 

27.30 

28.35 

27.00 

24.30 

27.80 

27.60 

26.30 

27.40 

23.50 

29.60 

27.80 

27.60 

25.35 

27.55 

29.00 

24.10 

27.00 

24.50 

27.25 

26.15 

29.30 

23.10 

27.10 

28.50 

27.45 

26.15 

28.35 

27.95 

25.55 

27.55 

26.60 

24.25 

30.00 

28.55 

28.00 

27.30 

27.90 

25.25 

24.10 

27.45 

24.55 

26.55 

27.55 

26.75 

31.00 

24.00 

25.35 

26.50 

28.30 

27.95 

25.55 

30.25 

28.55 

26.75 

24.60 

25.75 

26.55 

27.80 

28.90 

29.55 

30.00 

24.60 

25.75 

26.30 

27.00 

28.25 

25.25 

25.75 

26.25 

26.30 

26.75 

27.90 

28.30 

25.70 

26.30 

26.60 

27.00 

30.75 

28.60 

28.10 

23.50 

24.75 

25.15 

26.30 

27.25 

28.15 

29.10 

30.10 

29.90 

28.55 

27.30 

26.55 

27.55 

23.00 

24.50 

22.85 

26.55 

27.55 

28.10 

30.70 

28.60 

27.90 

26.80 

24.10 

25.25 

26.30 

27.90 

26.90 

25.30 

25.80 

28.85 

27.55 

27.30 

25.00 

26.00 

26.55 

27.80 

28.60 

30.55 

29.50 

24.10 

25.15 

27.15 

28.10 

26.30 

27.10 

24.60 

27.80 

26.30 

27.90 

29.80 

24.10 

25.15 

27.50 

24.25 

25.70 

26.80 

30.15 

29.30 

28.15 

28.65 

24.55 

25.85 

26.10 

27.00 

26.80 

27.55 

29.00 

23.00 

28.60 

29.30 

28.55 

28.80 

27.55 

23.60 

26.10 

27.15 

25.75 

26:80 

27.15 

26.30 

28.55 

25.80 

24.55 

25.80 

26.75 

27.30 

27.55 

28.25 

25.60 

26.30 

26 85 
28. GO 

27.30 

26.00 

28.10 

32.00 

28.15 

26.30 

27.75 

26.25 


Frequency Tables 

While the array presents the figures in a shape much 
more suitable for study than the haphazard distribution 
first shown, there is still something to be desired before the 
mind can readily grasp the full significance of the data. 
The factory manager may see that the smallest amount 
earned during the week was $22 . 85, that the largest amount 
earned was $32 . 00, and that most of the employees earned 
between $25 . 00 and $29 . 00, but this is still a vague descrip- 
tion of the data. By a process of grouping, that is, by 
putting into common classes all individuals whose earnings 
fall within certain limits, a simplified and more compact 
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Array: Weekly Earnings of 210 Employees 


$22.85 

23.00 

23.00 

23.10 

23.40 

23.50 

23.50 

23.60 

24.00 

24.10 

24.10 

24.10 

24.10 

24.10 

24.15 

24.25 

24.25 

24.30 

24.35 

24.50 

24.50 

24.55 

24.55 

24.55 

24.60 

24.60 

24.60 

24.75 

24.75 

25.00 


$25.15 

25.15 

25.15 

25.25 

25.25 

25.25 

25.25 

25.30 

25.30 

25.35 

25.35 

25.55 

25.55 

25.60 

25.70 

25.70 

25.75 

25.75 

25.75 

25.75 

25.75 

25.75 

25.80 

25.80 

25.80 

25.85 

26.00 

26.00 

26.10 

26.10 


$26.15 

26.15 

26.25 

26.25 

26.25 

26.30 

26.30 

26.30 

26.30 

26.30 

26.30 

26.30 

26.30 

26.30 

26.30 

26.30 

26.35 

26.40 

26.50 

26.55 

26.55 

26.55 

26.55 

26.55 

26.60 

26.60 

26.60 

26.70 

26.75 

26.75 


$26.75 

26.75 

26.80 

26.80 

26.80 

26.80 

26.85 

26.90 

27.00 

27.00 

27.00 

27.00 

27.00 

27.10 

27.10 

27.15 

27.15 

27.15 

27.20 

27.25 

27.25 

27.30 

27.30 

27.30 

27.30 

27.30 

27.30 

27.30 

27.40 

27.40 


$27.45 

27.45 

27.50 

27.55 

27.55 

27.55 

27.55 

27.55 

27.55 

27.55 

27.55 

27.55 

27.60 

27.60 

27.60 

27.70 

27.75 

27.75 

27.80 

27.80 

27.80 

27.80 

27.80 

27.80 

27.80 

27.90 

27.90 

27.90 

27.90 

27.90 


$27.95 

27.95 

28.00 

28.10 

28.10 

28.10 

28.10 

28.15 

28.15 

28.15 

28.20 

28.25 

28.25 

28.30 

28.30 

28.30 

28.30 

28.35 

28.35 

28.50 

28.55 

28.55 

28.55 

28.55 

28.55 

28.60 

28.60 

28.60 

28.60 

28.60 


$28.60 

28.65 

28.70 

28.80 

28.85 

28.90 

29.00 

29.00 

29.10 

29.20 

29.30 

29.30 

29.30 

29.50 

29.55 

29.60 

29.75 

29.80 

29.90 

30.00 

30.00 

30.10 

30.15 

30.25 

30.55 

30.60 

30.70 

30.75 

31.00 

32.00 


presentation of the wage distribution may be obtained. The 
following table shows the results of this grouping process 
when the range of each class (the class-interval) is two 
dollars. 

This table presents a condensed summary of the original 
figures, a summary which not only gives us the approximate 
range of the earnings, but shows, also, how the earnings of 
the 210 workers are distributed throughout this range. 
There has been a considerable loss of detail, it will be 
noted. 
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Table 6 

Frequency Distribution of Employees 
(Classified on the basis of weekly earnings [class-interval =» $2]) 


Weekly earnings 

Number earning stated amount 

(frequency) 

$22.00 to $23.99 

8 

24.00 to 25.99 

48 

26.00 to 27.99 

96 

28.00 to 29.99 

47 

30.00 to 31.99 

10 

32.00 to 33.99 

1 

210 

From this table we may learn that there are 48 persons who 


earned during the given week between $24.00 and $25.99, 
but we cannot learn how the earnings of the 48 individuals 
were distributed throughout this range of two dollars. All 
may have earned exactly $24.00, so far as we may know 
from the figures shown in the table. This loss of detail is 
an inevitable accompaniment of the condensation and sim- 
plification which the process of classification involves. 

If the size of the class-interval be decreased the loss of 
detail is less pronounced, though the increase in the number 
of classes means a more cumbersome table and one which 
presents a more complex picture to the eye. The tables 
on page 55 present the same data, classified with intervals 
of one dollar, fifty cents, and twenty-five cents. 

The four tables we have thus constructed represent four 
different degrees of condensation of the same data. Tables 6, 
7, and 8 present the same general characteristics: a small 
number of cases in the extreme classes and a more or less 
regular increase in the frequencies as the center of each of 
the distributions is approached. The departure from reg- 
ularity becomes greater the greater the number of classes. 
Table 9, in which the class-interval is 25 cents, has 38 classes. 
In this table the distribution of cases throughout the range 
is highly irregular, with pronounced departures from sym- 
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Frequency Distributions of Employees 
(C lassified on the basis of weekly earnings) 


Table 7 
(Class-interval 

= $1) 

Table 8 

(Class-interval = 50 cents) 

Table 9 

(Class-interval = 25 cents) 

Weekly 

Fre - 

Weekly 


Weekly 

Fre- 

earnings 

quency 

earnings 


earnings 

quency 

$22.00 to $22.99 

1 

$22.50 to $22.99 

i 

$22.75 to $22.99 

1 

23.00 to 23.99 

7 

23.00 to 23.49 

4 

23.00 to 23.24 

3 

24.00 to 24.99 

21 

23.50 to 23.99 

3 

23.25 to 23.49 

1 

25.00 to 25.99 

27 

24.00 to 24.49 

11 

23.50 to 23.74 

3 

26.00 to 26.99 

42 

24.50 to 24.99 

10 

23.75 to 23.99 

0 

27.00 to 27.99 

54 

25.00 to 25.49 

12 

24.00 to 24.24 

7 

28.00 to 28.99 

34 

25.50 to 25.99 

15 

24.25 to 24.49 

4 

29.00 to 29.99 

13 

26.00 to 26.49 

22 

24.50 to 24.74 

8 

30.00 to 30.99 

9 

26.50 to 26.99 

20 

24.75 to 24.99 

2 

31.00 to 31.99 

1 

27.00 to 27.49 

24 

25.00 to 25.24 

4 

32.00 to 32.99 

1 

27.50 to 27.99 

30 

25.25 to 25.49 

8 


Krrrfl 

28.00 to 28.49 

17 

25.60 to 25.74 

5 



28.50 to 28.99 

17 

25.75 to 25.99 

10 



29.00 to 29.49 

7 

26.00 to 26.24 

6 



29.50 to 29.99 

6 

26.25 to 26.49 

16 



30.00 to 30.49 

5 

26.50 to 26.74 

10 



30.50 to 30.99 

4 

26.75 to 26.99 

10 



31.00 to 31.49 

1 

27.00 to 27.24 

11 



31.50 to 31.99 

0 

27.25 to 27.49 

13 



32.00 to 32.49 

1 

27.50 to 27.74 

14 




210 

27.75 to 27.99 

16 





28.00 to 28.24 

9 





28.25 to 28.49 

8 





28.50 to 28.74 

14 





28.75 to 28.99 

3 





29.00 to 29.24 

4 





29.25 to 29.49 

3 





29.50 to 29.74 

3 





29.75 to 29.99 

3 





30.00 to 30.24 

4 





30.25 to 30.49 

1 





30.50 to 30.74 

3 





30.75 to 30.99 

1 





31.00 to 31.24 

1 





31.25 to 31.49 

0 





31.50 to 31.74 

0 





31.75 to 31.99 

0 





32.00 to 32.24 

1 






210 


metry. The structure of each of the other tables is orderly 
and approaches more closely a condition of symmetry. Each 
presents the wage data in condensed and compact form, so 
that one consulting the tables may learn of the size and 
distribution of weekly earnings in the factory in question 
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much more readily than by reference to the chaotic collec- 
tion of figures first shown. Such organized collections of 
data are termed frequency distributions, and their purpose, 
as the term implies, is to show in a condensed form the na- 
ture of the distribution of a variable quantity throughout 
the range covered by the values of the variable. The con- 
struction of such a table is the first step to be taken in the 
organization and analysis of quantitative data of the type 
represented above. 

STEPS IN THE CONSTRUCTION OF A FREQUENCY TABLE 

This general introduction to the subject of frequency 
tables has left untouched many important matters in con- 
nection with their construction. It remains to present a 
summary statement of these details. It will be clear that 
the first step here taken, the arrangement of the items in 
order of magnitude, is unnecessary in the actual construc- 
tion of such a table. Having determined the upper and 
lower limits through an inspection of the data, one has 
but to decide on the number of classes desired, write the 
class-intervals on an appropriate blank sheet, and proceed 
to tally the cases falling in each of the classes thus set off. 
When this process is completed the frequencies are com- 
puted and the totals arranged in tabular form of the type 
illustrated above. These simple operations involve decisions 
on a number of points, however. 

SIZE OF CLASS-INTERVAL 

In deciding upon the size of the class-interval (which is 
equivalent to deciding upon the number of classes) one 
fundamental consideration should be borne in mind, namely, 
that classes should be so arranged that there will be no 
material departure from an even distribution of cases within 
each class. This arrangement is necessary because, in inter- 
preting the frequency table and in subsequent calculations 
based upon it, the mid-value of each class is taken to repre- 
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sent the values of all cases falling in that class. Thus, in 
basing calculations upon Table 8, it is assumed that the 
22 cases falling between $26 . 00 and $26 . 50 may .all be repre- 
sented by the mid-value of that class, $26 . 25. This assump- 
tion will seldom be strictly valid. In the case just cited 
reference to the original figures will show that it is not a 
correct assumption. Absolute accuracy would only be ob- 
tained by having a class for every value represented in the 
original figures. Since condensation is necessary an arrange- 
ment of classes should be secured which will minimize the 
error involved, without transgressing other requirements. 
Table 6 furnishes an example of class-intervals too wide 
for the material. 

The requirement which has just been described clearly 
calls for a large number of classes. A second requirement, 
which ordinarily conflicts with this, is that the number of 
classes should be so determined that an orderly and regular 
sequence of frequencies is secured. If the classification is 
too narrow for the data regularity will not be attained in 
this respect, and a table without structure or order will be 
secured. Table 9 fails to meet this requirement, as has been 
pointed out. It is desirable, also, that the number of classes 
be limited in order that the data may be easily manipulated 
and their significance readily grasped. 

A useful procedure for approximating a suitable class- 
interval has been suggested by H. A. Sturges. Given a 
series of N items of known range, a suitable class-interval i 
may be approximated from the formula 

. _ Range 
* ~ 1 + 3.322 log N 

The specific figure secured in a given instance is likely to 
be a fractional value, quite unsuited to actual use. An 
appropriate round number close to the theoretical value, 
may be chosen. 1 Thus, in the example cited above, with a 

1 This formula, and the justification for its use, are discussed in “The Choice 
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range of $9 . 15 and N equal to 210, the use of a class-interval 
of $1.05 is indicated by the formula. The nearest round 
number, suitable with reference to other considerations as 
well, is $1 . 00. Table 7, in which this class-interval is em- 
ployed, seems to conform most thoroughly to all the re- 
quirements we have set forth. 

LOCATION OF CLASS LIMITS 

The location of class limits is a matter of considerable 
importance, for attention to this matter will simplify tab- 
ulation and facilitate later calculation. Tabulation of data 
is easiest when class limits are integers and the class-interval 
itself is a whole number. Calculation of averages and other 
statistical measures is facilitated when the mid- values of 
classes are integers. Suitable class limits and mid-points 
are usually secured when the data permit class-intervals of 5 
or multiples of 5 to be employed, though such an arrange- 
ment is by no means essential. 

Some types of data show a tendency to cluster or con- 
centrate about certain values on the scale along which they 
are distributed. This is illustrated by the following figures 
which form part of a table showing the number of pieces 
of commercial paper discounted by the Federal Reserve 
Banks in 1921, distributed accordihg to rates of discount 
or interest charged by member banks : 


( per cent) 

Number of pieces 

6 

18,970 

6* 

697 

6* 

4,616 

6f 

135 

7 

17,362 

7i 

10 


of a Class Interval” by Herbert A. Sturges, Journal of the American Statistical 
Association, March, 1926, 65-6. The use of the formula rests on the assumption 
that the proper distribution into classes is given, for all numbers which are 
powers of 2, by a series of binomial coefficients. The relation of the terms in 
the binomial expansion to the theory of frequency distributions is discussed 
below, in Chapter XIII. 
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Here is a quite obvious bunching about the integers, with 
a secondary concentration at each half of one per cent. 
No cases at all fall between the quarter values here shown. 
It is clear that in classifying such data the mid-points of 
the various classes should fall at those values about which 
the cases are concentrated, and class limits must be located 
with this end in view. For, as noted above, calculations 
based upon the frequency table are performed upon the 
assumption that all the items in each class are concentrated 
at the mid-point of that class. Thus, if a class interval of 
one half of one per cent were selected in the above example, 
the classes should extend from 5f to (but not including) 
61, 61 to 61, etc., rather than from 6 to 61, 61 to 7, etc. 


ACCURACY OF OBSERVATIONS AND THE DEFINITION 
OF CLASSES 

In the construction of frequency tables it is essential 
that there be a clear definition of classes, so that there may 
be no uncertainty as to their range and no question as to 
the precise class in which a given case falls. A table with 
an arrangement similar to the following is sometimes en- 
countered: 


Class-interval 


Frequency 


0 to 10 3 

10 to 20 8 

20 to 30 15 

30 to 40 6 

40 to 50 2 


In the absence of explanation, a question arises at once as 
to whether a case with a value of 10 would fall in the first 
or in the second class. It is highly desirable that the range 
of each class be indicated in some such way as the following, 
in order that this ambiguity may be avoided: 
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Class-interval Frequency 

Oto 9.9 3 

10 to 19.9 8 

20 to 29.9 15 

30 to 39.9 6 

40 to 49.9 2 

This procedure solves the difficulty, however, only in case 
the observations are accurate to the nearest tenth. If the 
observations are accurate only to the nearest unit (that is, 
if the cases recorded as having a value of 10 actually lie 
between 9 . 5 and 10.5) a mere change in the description of 
the class range does not solve the problem of allocating a 
case at the class limit. In such a case an observation falling 
at a class boundary may be cut in two, one half being allo- 
cated to each of the adjacent classes. 

Yule lays down the useful principle that in fixing a class 
boundary the limit should be carried to a farther place in 
decimals, or a smaller fraction, than the values of the indi- 
vidual cases as originally recorded. Thus, in the preceding 
example, if observations were correct to the nearest tenth, 
it would mean that a value recorded as 9.9 actually lay 
between 9 . 85 and 9 . 95. In accurately describing the classes, 
therefore, the intervals should be given as 0 to 9.95, 9.95 
to 19.95, etc. (Since the observations to be tabulated are 
recorded only to the first decimal place no ambiguity arises 
from the apparent over-lapping of these class limits.) It 
should be noted that the values of the mid-points, with 
these class limits, would be 4.95, 14.95, etc. In presenting 
and using the table as given above the real meaning of the 
class limits should be borne in mind. In all cases class 
boundaries must be fixed with reference to the accuracy of 
the observations. 

The work of tabulation is simplified if, in designating a 
class, both limits are stated, as above. Errors are likely if 
only the lower limit of each class is given, or if the mid- 
point alone is designated. It is desirable, however, par- 
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ticularly if calculations are to be based upon the table, to 
include a separate column showing the values of the mid- 
points of the various classes. 

OTHER REQUIREMENTS 

Class-intervals should be uniform throughout the table 
in order that all classes may be comparable. Occasionally 
tables are published with varying class-intervals, so that 
on one section of the scale the number of items falling 
within a class having an interval of 5 is given, and on 
another section of the scale the number of items falling 
within a class having a range of 10 is given. Obviously, 
comparison of classes is impossible. It may be desirable 
to show in more detail the cases falling within certain 
ranges on the scale, but this end is best achieved by the 
construction of a supplementary table relating only to the 
cases falling within this restricted section. The utility of 
the main table is not lessened thereby. 

Similar in nature is the requirement that there should be 
no indeterminate classes, that is, classes the ranges of which 
are not defined. Had all the individuals making $30 . 00 and 
over in the illustration of piece-work earnings been entered 
in a class with the designation “$30.00 and over,” the 
upper limit of this class would have been quite uncertain. 
This fault in a table is a vital one when it is desired to base 
calculations upon the data contained in the table. When 
there are several extreme cases the inclusion of such classes 
is sometimes unavoidable, but when this is done the actual 
values of the cases included in such “open end” classes 
should be given in a footnote to the table. 

The errors described in the two preceding paragraphs 
are exemplified in the table on page 62. 

In this case the ranges of the two “open end” classes are 
not known. The ranges of the intermediate classes vary, 
being $5.00 for two classes, $10.00 for one class, $20.00 
for one class and $25.00 for two classes. The purposes of a 
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Table 10 

Frequency Distribution of Rented Dwellings in Reno , Nevada, 1934 
(Classified on the basis of rental value 


Monthly rental 

Under $10.00 
$10.00 to $14.99 

15.00 to 19.99 

20.00 to 29.99 

30.00 to 49.99 

50.00 to 74.99 

75.00 to 99.99 
$100.00 and over 


Number of dwellings 
in each class 
( frequency ) 

327 

349 

521 

1,039 

1,075 

189 

24 

9 

3,533 


special investigation may sometimes be served by the use 
of such a form, but a table of this type is poorly adapted to 
the requirements of statistical calculation. 

The Steuctuke of Statistical Tables 

The preceding discussion has been confined to certain 
more or less technical problems which arise in the construc- 
tion of a frequency table. Nothing has been said directly 
as to the form of the completed table, the arrangement of 
columns and rows, the title, the notation. No general prin- 
ciples of tabular arrangement have been laid down. While 
no detailed treatment of these principles is possible within 
the scope of the present discussion, certain general consid- 
erations relating to the structure of statistical tables may be 
suggested. 

The statistical table is merely a device for presenting in 
s ummar y fashion a mass of quantitative data. Unless the 
s ummar y be clear, significant, concise, and readily inter- 
preted nothing has been gained by the process of tabulation 

1 The table is taken from Real Property Inventory , 1934. Summary and Sixty- 
Four Cities Combined , Department of Commerce, Washington. Figures for 
255 rented dwellings in Reno were not reported. 
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and classification. A sprawling, formless table is like a 
rambling, unintelligible discourse. There must be a purpose 
in back of each table, and this purpose should be clearly 
brought out in its arrangement. The means by which this 
purpose may be attained in a given case must be deter- 
mined with reference to the specific conditions affecting that 
case, but standard practices should be followed, in so far 
as possible. The following general principles will be found 
helpful in deciding upon the form and arrangement of sta- 
tistical tables: 

1. The title should constitute a clear, concise and complete 

description of the material assembled in the table. 

2. Headings of columns and rows should be concise and unambigu- 

ous. 

3. Variable quantities should increase from left to right and 

from top to bottom, when such arrangement is feasible. 

4. Columns and rows may be numbered to facilitate reference to 

the tabic. 

5. The units of measurement employed should be clearly indicated. 

6. Sources should be given in all cases. 

7. The table should constitute a unit, self-sufficient and self- 

explanatory. All explanations necessary for its interpre- 
tation should be included as integral parts of the table, or in 
the form of footnotes. 


Graphic Representation of Frequency Distributions 

Frequency distributions of the type illustrated above serve 
a very important statistical function in presenting a compact 
summary of data, and in preparing these data for further 
manipulation. Such distributions may be presented not 
only in tabular form, but graphically, utilizing the general 
principles of the coordinate system which were explained 
above. Many of the characteristic features of a frequency 
distribution are most clearly revealed when the graphic 
method is adopted. 

Table 6, presenting the weekly earnings of 210 employees, 
with a class-interval of two dollars, is depicted graphically 
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in Fig. 23. In this figure class-intervals are plotted along 
the z-axis and the corresponding class-frequencies along the 
y-sixis, appropriate scales being selected. The fact should 
be noted that the scale of abscissas starts not with zero, 
but with $20. For convenience in presentation that part 



Dollars 

Fig. 23. — Column Diagram: Distribution of 210 Employees Classified 
on the Basis of Weekly Earnings (Class-interval = $2.00) 

of the scale extending from 0 to $20 is omitted. The student 
should bear this in mind in seeking to secure a correct 
impression of the relations between the two variables plotted. 
In constructing such a figure, which is termed a column 
diagram or histogram, short horizontal lines are drawn con- 
necting the points plotted to represent the upper and lower 
limits of each class-interval. In interpreting this diagram 
it should be noted that the areas of the different rectangles 
are proportional to the number of cases represented, the 
total area representing the entire 210 cases. This device 
thus presents to the eye a very clear picture of the distribu- 
tion, showing quite unmistakably the relative number of 
workers falling in each of the wage classes. 

The classes in this case are so large, however, that some 
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violence is done to the facts. So many details are lost 
that a true conception of the disposition of the items is not 
given. Fig. 24 is a histogram depicting the distribution of 
cases when a class-interval of one dollar is used. In this 
case, with smaller steps, we approach more closely an orderly 
and symmetrical distribution. The same is true of Fig. 25 



on the Basis of Weekly Earnings (Class-interval = $1.00) 


which shows the distribution when the class-interval is 
fifty cents. The distribution represented in Fig. 26 has a 
class-interval of twenty-five cents which, as has been pointed 
out, is too narrow for the data, with the result that a quite 
irregular structure is secured. (It should be noted that 
the vertical scale is not the same in these four figures, so 
that comparison with respect to class-frequencies is only 
possible bv reference to the scale figures.) 




Fig. 26. — Column Diagram: Distribution of 210 Employees Classified 
on the Basis of Weekly Earnings (Class-interval =•• $.25) 
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Fig. 27. — Frequency Polygon: Distribution of 210 Employees Classified 
on the Basis of Weekly Earnings (Class-interval = $2.00) 



Fig. 28. — Frequency Polygon: Distribution of 210 Employees Classified 
on the Basis of Weekly Earnings (Class-interval = $1.00) 

67 
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Frequency polygons corresponding to the histograms of 
Figs. 23, 24, and 25 are shown in Figs. 27, 28, and 29. 
Each of these polygons has been constructed by plotting as 
abscissas the mid-points of the class-intervals, and as ordi- 
nates the class-frequencies, the points thus secured being 



Fio. 29. — Frequency Polygon: Distribution of 210 Employees Classified 
on the Basis of Weekly Earnings (Class-interval = $.50) 

connected by a broken line. In completing such a figure 
the class next below the lowest one on the scale and the 
class next above the highest one on the scale are included, 
the class-frequency being zero in each case. The ends of 
the polygon thus connect with the base line at the mid- 
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points of these two extra classes. In the case of the fre- 
quency polygon the entire area under the curve represents 
the entire number of cases, but the area of a given interval 
cannot be taken to be proportional to the number of cases 
in that interval, because of irregularities in the distribution 
on either side of the given class. The heights of the ordi- 
nates at the mid-points of the various classes are, of course, 
scaled to represent the class-frequencies. 

The Smoothing of Curves 

Attention is again called to the results secured with 
varying class-intervals. As the class-interval is decreased, 
up to a certain point, the histograms and polygons become 
smoother and more regular. Beyond that point breaks 
begin to appear in the data; the regular change in class- 
frequencies which was found when the classes were larger 
is broken by the appearance of irregular classes which 
seem to depart from the general rule. In Fig. 25 these 
have become quite pronounced. Such irregularities, it is 
obvious, are exceptions to a general rule which seems to 
prevail, the general rule that the numbers of workers falling 
within the different wage classes increase from the lower 
limit of earnings up to a maximum in the neighborhood of 
$27.50, and then decrease till, at the upper limit of $32, 
but one worker is found. Since all the 210 individuals are 
engaged in the same work, and since their earnings depend 
only upon their rapidity and skill, one would expect a quite 
regular increase and decrease. If we had figures not for 
one week only, but for 52 weeks, and took the average 
weekly earnings of each of the 210 workers for the year, 
we should expect greater regularity with the smaller class- 
intervals than is actually found, since the accidental fluc- 
tuations peculiar to one week alone would thus be elimi- 
nated. Or, if we had earnings during one week for 10,920 
workers (52 times 210), the same result would be secured. 
Thus, if regularity and smoothness are to be secured, it is 
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essential not only to decrease the size of the classes but 
also to increase the number of cases, in order that the 
accidental irregularities which affect a small number of 
observations may be eliminated. A refined classification 
with a small number of cases leads to the condition exempli- 
fied in Fig. 26. But such an increase in the number of cases 
is, in general, a practical impossibility. We wish, if pos- 
sible, to develop a feasible method of approximating the 
distribution which would be seemed with very small class- 
intervals and a very large number of cases. Such an approxi- 
mation is possible through the device of curve-smoothing. 
By this method we may secure a smooth frequency curve 
which lacks the irregularities occasioned by minor fluctu- 
ations. 

Such a smooth frequency curve serves to represent the 
true underlying distribution of the data. It was pointed 
out that areas in the frequency polygon are not propor- 
tional to the number of cases included, the cause lying in 
the irregularities of the data. In a smoothed frequency 
curve these irregularities have been eliminated, and the 
area between ordinates erected at given points on the scale 
of abscissas is assumed to be proportional to the theoretical 
frequency of cases between the given values. Moreover, a 
smooth trend having been established, frequencies for in- 
termediate values not shown in the original table may be 
determined by interpolation. 1 

The following data, 2 representing the distribution in 1918 
of personal incomes below $4,000, will serve to exemplify 
the smoothing process. 


1 The limitations of practical statistical work are such that there must of 
necessity be many gaps in the data. The given values of the variables are not 
continuous. Interpolation is the process of estimating values of a variable 
quantity between given values, or of locating a point on a curve between given 
points. That interpolation is most accurate which leads to estimated values 
having the highest degree of consistency with the given values. 

2 From Vol. I, Income in the United Stales , National Bureau of Economic 
Research. New York, Harcourt, Brace & Co., 1921, 132-33. 
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Table 11 

Distribution of Income among Personal Income Recipients in 1918 
(Including all personal incomes below $4,000) 


Income class 1 

Number of pert 

$ 0 to $ 100 

62,809 

100 to 200 

103,704 

200 to 300 

209,087 

300 to 400 

489,963 

400 to 500 

961,991 

500 to 600 

1,549,974 

600 to 700 

2,154,474 

700 to 800 

2,668,466 

800 to 900 

3,013,034 

900 to 1,000 

3,144,722 

1,000 to 1,100 

3,074,351 

1,100 to 1,200 

2,850,526 

1,200 to 1,300 

2,535,285 

1,300 to 1,400 

2,205,728 

1,400 to 1,500 

1,832,230 

1,500 to 1,600 

1,512,649 

1,600 to 1,700 

1,234,397 

1,700 to 1,800 

999,996 

1,800 to 1,900 

811,236 

1,900 to 2,000 

663,789 

2,000 to 2,100 

549,787 

2,100 to 2,200 

463,222 

2,200 to 2,300 

395,115 

2,300 to 2,400 

340,141 

2,400 to 2,500 

295,490 

2,500 to 2,600 

258,650 

2,600 to 2,700 

227,731 

2,700 to 2,800 

201,488 

2,800 to 2,900 

178,901 

2,900 to 3,000 

154,499 

3,000 to 3,100 

142,802 

3,100 to 3,200 

128,217 

3,200 to 3,300 

115,583 

3,300 to 3,400 

104,504 


1 The definition of classes used is equivalent to “$0 to and not including 
$100,” etc. Thus an individual with an income of $100 would fall in the second 
class. 

* The National Bureau’s report states “The numbers below are given to the 
nearest unit. It is not pretended that such arithmetic accuracy is anything 
more than technical.” 
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Table 11 — Continued 


Distribution of Income among Personal Income Recipients in 1918 


Income class 
$3,400 to 3,500 
3,500 to 3,600 
3,600 to 3,700 
3,700 to 3,800 
3,800 to 3.900 
3,900 to 4,000 


Number of persons 
94,803 
86,405 
79,023 
72,562 
66,900 
61,894 


Figures 30, 31, and 32 present column diagrams of these 
income data, grouped with class-intervals of $500, $200, and 
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Fig. 30. — Column Diagram: Distribution of Personal Income Recipients 
in the United States, 1918. Including all Recipients of Incomes below 
$4,000 (Class-interval = $500) 



$100. As the class-interval is decreased the histograms be- 
come more regular and uniform, but our original data 
permit us to carry this process only to the point where the 
class-interval is $100. Our problem is to determine the 
underlying distribution which the data approximate more 
and more closely as the class-interval is lessened. If we 
replace the broken line of the histogram by a smooth curve 
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1 , 000,000 

Fig. 31. — Column Diagram: Distribution of Personal Income Recipients 
in the United States, 1918. Including all Recipients of Incomes below 
$4,000 (Class-interval = $200) 
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Fig. 32. — Column Diagram: Distribution of Personal Income Recipients 
in the United States, 1918. Including all Recipients of Incomes below 
14,000 (Class-interval = $100) 
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enclosing the same total area as the histogram and so drawn 
through the points of the histogram that the area cut from 
each rectangle is approximately equal to the area added to 
the same rectangle by the curve, we will have a frequency 



Dollars 

Fia. 33. — Frequency Curve: Distribution of Personal Income Recipients 
in the United States, 1918. Including all Recipients of Incomes below 
$4,000 (Derived from the column diagram with class-interval of $100) 

curve representing the desired distribution. The require- 
ment that the same total area be enclosed is fundamental. 
Exceptions to the rule concerning the area of individual 
rectangles will frequently occur because of the existence of 
quite irregular classes, but as a general working principle 
it is helpful. (More refined methods of fitting a smooth 
curve to data will be discussed at a later point, but a process 
of smoothing by inspection such as that described above 
gives a fairly close approximation to the required curve.) 

Figure 33 illustrates the result of smoothing the histo- 
gram of income distribution shown in Fig. 32. Here the 
quite artificial jumps between income classes are smoothed 
out, and we secure the graduation by infinitesimal incre- 
ments which we should expect to find when the incomes of 
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so many millions of persons are included. Here we have 
that which we desired — an approximation to the true un- 
derlying distribution, with the sharp breaks resulting from 
the method of classification eliminated. 

CONTINUOUS AND DISCRETE SERIES 

The logical validity of the smoothing process is dependent 
upon the nature of the data being manipulated. From 
this point of Anew frequency series of the type discussed 
above may be divided into two classes, continuous series 
and non-continuous or discrete series. A continuous series is 
one in which the values of the independent variable in- 
crease or decrease by increments which are infinitely small. 
A discrete series is one in which the phenomena represented 
by the independent variable always change in value by 
definite amounts. The curve of underlying values rises not 
smoothly, as for the continuous series, but by jumps. 

The fact should be emphasized that in making this dis- 
tinction we are speaking of the values as they would be found 
in the underlying universe of phenomena from which the ac- 
tual bodies of material we study are drawn. Any given 
sample, whether representing continuous or discrete series, 
will be marked by breaks in the values of the independent 
variable. This will be true, in the case of a continuous 
series, because of the limitations upon the instruments and 
senses we use in measuring. Thus if the heights of 100 men 
be measured, the independent variable of the frequency 
'series (height) will increase by finite amounts. We may 
measure to the nearest inch, or perhaps to the nearest 
eighth of an inch. Yet if ten thousand or ten million men 
were arranged in order of height the differences between 
successive individuals would be infinitely small. Height is a 
continuous variable, even though the values found in a given 
sample are marked by discontinuity. 

Quite different is the distribution of such a variable as 
interest or discount rates. If one were to secure 100 such 
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quotations and rank them in the order of size the varia- 
tions would be discontinuous, as in the sample of men 
whose heights were measured. But in the case of heights 
the underlying values, if they could be determined for a 
large population, would be marked by continuous varia- 
tion, whereas, were an infinite number of discount rate 
quotations secured, there would still be breaks in the se- 
quence. Discount rates increase or decrease by one quarter 
or one half of one per cent, not by infinitesimal amounts. 
Such a series is termed discrete, or non-continuous. 

The smoothing process provides a means of securing an 
approximation to the distribution of values as they would 
be found if a sample could be increased indefinitely in size. 
It is based upon the assumption that the irregularities 
found in the sample actually studied are accidental, and 
that the underlying values would show continuous and un- 
broken variation. Obviously, therefore, it is only fully 
justified when applied to a continuous series. A histogram 
of human heights may be smoothed in order to secure a 
representation of the true underlying distribution in the pop- 
ulation at large, and interpolation based upon this smoothing 
process is valid. But smoothing is quite illogical for a 
markedly discontinuous series. It would be meaningless to 
construct a smooth curve showing the distribution of dis- 
count rates for the purpose of securing the theoretical fre- 
quency of a rate of 4 . 3675 per cent. In practical statistical 
work, however, it is frequently helpful to handle discrete 
series as though they were continuous, and in these cases 
the smoothing device may be employed. But in the inter- 
pretation and use of the smoothed curve the important 
logical distinction between continuous and discontinuous 
variation should be kept clearly in mind. 

Cumulative Arrangement of Statistical Data 

For certain purposes it is desirable to arrange data cumu- 
latively, rather than in separate and exclusive classes of 
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the type illustrated in the frequency tables presented above. 
The following material will illustrate some of the advantages 
of this arrangement. 

In a study of the durability of telephone poles 1 these 
results were secured: 


Table 12 

Frequency Distribution of 248,707 Telephone Poles, Classified 
According to Length of Life 


Length of life 
(years) 

0 - 0.9 

1 - 1.9 

2 - 2.9 

3 - 3.9 

4 - 4.9 
6 - 5.9 

6 - 6.9 

7 - 7.9 

8 - 8.9 

9 - 9.9 

10 - 10.9 

11 - 11.9 

12 - 12.9 

13 - 13.9 

14 - 14.9 

15 - 15.9 

16 - 16.9 

17 - 17.9 

18 - 18.9 

19 - 19.9 

20 - 20.9 

21 - 21.9 

22 - 22.9 

23 - 23.9 


Number of poles 
(frequency) 

1,150 

4,221 

10,692 

13,966 

16,633 

18,211 

19,011 

19,260 

20,909 

19,879 

20,764 

15,454 

14,237 

13,779 

9,764 

8,534 

7,659 

6,918 

4,591 

1,798 

815 

313 

102 

47 


The table shows that 1,150 poles were scrapped during 
the first year of use, that 4,221 were scrapped after reaching 
the age of one year and before reaching the age of two 

1 11 Replacement Insurance,” Edwin Kurtz. Administration, July, 1921, 
41-69. 
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years, and so on. This is simply a frequency table of the 
ordinary type. A much more significant arrangement for 
many purposes is secured when the figures are assembled 
cumulatively, as in the following table. 

Table 13 

Cumulative Distribution of 248,707 Telephone Poles , Classified 
According to Length of Life 


(Cumulated upward with reference to life scale) 


Length of life 

Number of poles surviving 
{frequency) 

Less than 

1 year 

1,150 

tt 

it 

2 years 

5,371 

tt 

tt 

3 “ 

16,063 

it 

tt 

4 “ 

30,029 

ll 

it 

5 “ 

46,662 

it 

tt 

6 “ 

64,873 

tt 

tt 

7 “ 

83,884 

u 

tt 

8 “ 

103,144 

U 

tt 

9 “ 

124,053 

tt 

tt 

10 “ 

143,932 

(t 

it 

11 “ 

164,696 

tt 

tt 

12 “ 

180,150 

tt 

tt 

13 “ 

194,387 

tt 

tt 

14 “ 

208,166 

it 

tt 

15 “ 

217,930 

tt 

tt 

16 “ 

226,464 

tt 

tt 

17 “ 

234,123 

tt 

tt 

18 “ 

241,041 

tt 

it 

19 “ 

245,632 

tt 

tt 

20 “ 

247,430 

tt 

tt 

21 41 

248,245 

tt 

a 

22 “ 

248,558 

tt 

tt 

23 “ 

248,660 

tt 

tt 

24 “ 

248,707 


It is important to note that it is possible to cumulate a 
frequency series in two different ways. From the above 
table we may determine readily the number failing to at- 
tain any given age. It is often more convenient to reverse 
the process, so that the table will enable the total number 
above any given value to be immediately determined. When 
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the telephone pole figures are thus cumulated downward 
the following table is secured. 


Table 14 


Cumulative Distribution of 248,707 Telephone Poles, Classified 
According to Length of Life 
(Cumulated downward with reference to life scale) 


(1) 

Length of life 

(2) 

Number of poles surviving 
(frequency) 

(3) 

Per cent 

0 and more 

248,707 

100.0 

1 year 

it 

tt 

247,557 

99.5 

2 years 

it 

tt 

243,336 

97.8 

3 “ 

ft 

tt 

232,644 

93.6 

4 “ 

tt 

tt 

218,678 

88.0 

5 “ 

if 

tt 

202,045 

81.2 

6 “ 

ft 

tt 

183,834 

73.8 

7 “ 

ft 

tt 

164,823 

66.3 

8 “ 

it 

tt 

145,563 

124,654 

58.5 

9 “ 

it 

tt 

50.1 

10 “ 

tt 

tt 

104,775 

42.1 

11 “ 

tt 

it 

84,011 

33.8 

12 “ 

it 

tt 

68,557 

27.6 

13 “ 

it 

tt 

54,320 

21.8 

14 “ 

tt 

tt 

40,541 

16.3 

15 “ 

tt 

tt 

30,777 

12.4 

16 “ 

tt 

tt 

22,243 

8.9 

17 “ 

tt 

tt 

14,584 

5.9 

18 “ 

tt 

ft 

7,666 

3.1 

19 “ 

tt 

tt 

3,075 

1.2 

20 “ 

tt 

tt 

1,277 

0.5 

21 “ 

tt 

tt 

462 

0.2 

22 “ 

tt 

it 

149 

0.06 

23 “ 

it 

tt 

47 

0.02 

24 “ 

tt 

it 

0 

0.00 


Cumulative tables such as those given above have dis- 
tinct advantages in the handling of many types of data. Life 
tables are generally presented in this form. The scientific 
study of depreciation will lead to the construction of elab- 
orate “mortality tables” for various types of equipment, 
and these will be most useful in the cumulative form. It 
is frequently desirable to reduce the frequencies to per- 
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centages, as in column (3) of Table 14, though it should 
not be forgotten that the significance of the percentages 
depends upon the absolute numbers upon which they are 
based. 

THE OGIVE, OB CUMULATIVE FREQUENCY CURVE 

The general utility of such cumulated data is limited by 
the classification system necessarily adopted in condensing 

Number 
of Poles 



Fig. 34. — Cumulative Frequency Curve: Distribution of Telephone Poles 
Classified according to Length of Life (Cumulated upward) 

the material. Unless we interpolate mathematically we are 
limited to the points on the scale actually noted in the two 
tables. For this reason, a generalized cumulative curve 
similar to the smoothed frequency curve described in the 
preceding section is desirable. If the values given in Table 13 
be plotted on coordinate paper (the length of life in each 
case as abscissa, and the corresponding number of poles as 
ordinate) and a smooth curve drawn through the points 
thus plotted, the cumulative frequency curve shown in 




THE OGIVE 81 

Fig. 34 is secured. In Fig. 35 the data of Table 14 are 
plotted. 

Such a curve constitutes one of the most effective and 
useful representations of a frequency series. It is obvious 
that the limitations of the particular class-interval adopted 
are in large part removed; the shape of the curve will be 
fundamentally the same, though the class-interval and num- 

Number 
of Poles 



Fig. 35. — Cumulative Frequency Curve: Distribution of Telephone Poles 
Classified according to Length of Life (Cumulated downward) 

ber of classes may vary. Frequency curves of the usual 
type may not be compared unless the groupings are the 
same, but cumulative frequency curves are subject to no 
such restriction. Moreover, uneven class-intervals do not 
distort the ogive, or cumulative curve, as they do the ordinary 
frequency curve. 

The cumulative curve is particularly well adapted to 
interpolation. Thus if it is desired to know the number of 
poles surviving less than 15$ years, the value of the ordi- 
nate of the curve having 15$ as abscissa may be approxi- 
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mated from Fig. 34. A value of 222,000 is secured. If the 
number surviving 8£ years or more is desired, a similar 
estimate may be made from Fig. 35. The interpolated 
figure in this case is 135,000. 

Another type of interpolation possible with such a curve 
is the determination of the number of cases falling within 
any given interval. One is not limited to the class-intervals 
marked out in the original tables. For instance, it may be 
desirable to know the number of poles surviving more than 
10$ but less than 15 years. Reading from the table or from 
the chart we find that 217,930 poles survived less than 
15 years. Interpolating on the chart in the manner de- 
scribed above a figure of 154,000 is secured for the number 
surviving less than 10$ years. Subtracting the latter figure 
from the former we have 63,930 as the number of poles 
falling within the 10 J to 15 years interval. The figure is, 
of course, an approximation to the true value, as are all 
values secured through such smoothing and interpolation. 

It should be noted that the ogive may be derived directly 
from the array, without the formation of a frequency table 
as an intermediate step. This curve, in fact, may be looked 
upon as merely a graphic representation of the array. It 
represents one of the simplest forms of statistical organi- 
zation, as well as one of the most effective methods of 
manipulating quantitative data. 

RELATION BETWEEN THE OGIVE AND THE FREQUENCY 

CURVE 

The ogive and the frequency curve are merely two dif- 
ferent arrangements of precisely the same material, each 
arrangement having certain distinctive advantages. The 
characteristics of each may be more clearly apparent if the 
structural relationship between these two curves is under- 
stood. This relationship is graphically portrayed in Fig. 36. 1 

1 The suggestive arrangement shown in this figure was originated by Robert 
£. Chaddock. 
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Labor Cost per 1000 Feet 


Fig 36. — Distribution of Sawmills in the United States Classified ac- 
cording to Labor Cost in 1921. Illustrating the Structural Relation 
between the Ogive and the Frequency Curve. 
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This figure is based upon the following frequency table, 
showing the distribution of sawmills in the United States, 
classified on the basis of labor cost per 1,000 feet of lumber 
produced. 1 


Table 15 

Frequency Distribution of 269 Sawmills in the United States Classified 


According to Labor Cost in 1921 

Labor cost ( all employees) per 

Number of establishments 

1,000 feet , board measure 

(frequency) 

S1.00-S1.49 

3 

1.50- 

1.99 

10 

2.00- 

2.49 

14 

2.50- 

2.99 

22 

3.00- 

3.49 

38 

3.50- 

3.99 

40 

4.00- 

4.49 

38 

4.50- 

4.99 

33 

5.00- 

5.49 

20 

5.50- 

5.99 

11 

6.00- 

6.49 

10 

6.50- 

6.99 

11 

7.00- 

7.49 

8 

7.50- 

7.99 

4 

8.00- 

8.49 

4 

8.50- 

8.99 

3 


2C9 


The upper part of Fig. 35 indicates the method by which 
the ogive is built up. Just as in the histogram, the area of 
each rectangle is proportional to the number of cases falling 
in the given class. Since the operation is a cumulative 
one, however, the base of each rectangle is the cumulated 
frequencies of all preceding classes. Thus the ?/-value (fre- 
quency) of the first rectangle is 3, erected from zero as a 
base, the y-value of the second class is 10, erected from 3 
as a base, and so on. The slope of the curve connecting 
these rectangles is gradual at first when the frequencies 

1 From “Labor Efficiency and Productiveness in Sawmills,” Ethelbert 
Stewart, Monthly Labor Review , January, 1923, 14. Seven scattered cases 
above $9 .00 in value have been omitted from the table and the accompanying 

graph. 
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are low, then steeper as the frequencies become greater, 
and finally tapers off as the frequencies decrease near the 
upper limit of the distribution. This is the cumulative 
frequency curve, or ogive. 

When the various rectangles representing the class-fre- 
quencies are dropped to the zero line as a common base, 
the x-values remaining the same throughout, the histogram 
or column diagram described in an earlier section is secured. 
From this the frequency polygon or smoothed frequency 
curve may be derived. 
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•CHAPTER IV 


DESCRIPTION OF THE FREQUENCY 
DISTRIBUTION: AVERAGES 

The classification of quantitative data and the construc- 
tion of a frequency distribution constitute an important 
stage in the task of organization and analysis. By means 
of classification the underlying structure of the data may 
be revealed and the essential unity of a mass of material 
may be brought out. But this is only the first step in statis- 
tical analysis. It remains to develop methods of measuring 
and expressing more concisely the significant characteristics 
of a body of data. For certain purposes the frequency dis- 
tribution itself must be summarized and condensed, must be 
boiled down until its essence has been distilled into three or 
four significant figures. 

If each frequency distribution constituted a novel and 
unique problem, obeying a law peculiar to itself, the task 
of studying and describing such distributions would be a 
difficult one. Fortunately this is not so. Quantitative 
data in widely different fields, when assembled in frequency 
distributions, show certain common characteristics, obey 
certain general laws. Experience in one field, therefore, 
constitutes a guide to work in others. Uniformity in the 
behavior of masses of data makes possible the development of 
a generalized method of organizing, analyzing and comparing 
measurements drawn from many fields of scientific study. 

Comparison of Frequency Distributions 

This fact of a common law of arrangement running through 
the universe of quantitative facts may be brought home 
most effectively by a comparison of distributions illustrative 

86 
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of various types of data. The characteristics of the frequency 
distributions and of the frequency curves which follow should 
be noted, and the distributions compared. 



according to Height 

The curve in Fig. 37 is based upon the following data 
relating to the heights of 18,780 soldiers. 1 

Table 16 


Distribution of Soldiers Classified According to Height 


Height in inches 

Number of soldiers 

Height in inches 

Number of soldietn 

60 + 

197 

67 + 

3,017 

61 + 

317 

68 + 

2,287 

62 + 

692 

69 + 

1,599 

63 + 

1,289 

70 + 

878 

64 + 

1,961 

71 + 

520 

65 + 

2,613 

72 + 

262 

66 + 

2,974 

73 + 

174 


Total 18,780 


1 From G. C. Whipple, Vital Statistics , New York, Wiley, 1919, 377. 
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Fig. 38 depicts a frequency curve based upon 1,000 ob- 
servations, made at Greenwich, of the Right Ascension of 
Polaris. 1 The values on the abscissa define deviations, in 
seconds of time, from an origin near the mean of all the 
observations. Frequencies of occurrence of given values on 
the x-scale are measured, of course, as ordinates on the 



-3.5-3.0-2.5-2.0-l. 5-1.0 -5 0 +.5+1.0+1.5+2.0+2.5+3.0+3.5 

Magnitude of Deviation in Seconds of Time 

Fig. 38- — Frequency Curve: Distribution of Errors of Observation in 
Astronomical Measurements 

y-scale. The distribution plotted in Fig. 38 is given in 
Table 17 on page 89. 

If a piece of artillery be accurately adjusted on a given 
target (a point) and 100 shots be fired, it will be found 
that the points of impact of the hundred shots will be dis- 
persed about the target. No matter how accurate the piece 
or the adjustment only a small percentage of the shots 
will fall upon the exact point at which they were directed. 
The points of impact will be scattered about the target 
in a quite regular fashion, however. If a rectangle be so 
drawn as to include all the points of impact, and this rec- 

1 From E. T. Whittaker and G. Robinson, The Calculus of Observations' 
London, Blackie and Son, 1924, 174. 
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Table 17 

Distribution of Errors of Observation in Astronomical Measurements 


(1000 observations of the Right Ascension of Polaris) 


Magnitude of deviation , 
in seconds of time , from origin 

Number of observations 

- 3.5 

2 

-3.0 

12 

- 2.5 

25 

— 2.0 

43 

- 1.5 

74 

- 1.0 

126 

- 0.5 

150 

0 

168 

0.5 

148 

1.0 

129 

1.5 

78 

2.0 

33 

2.5 

10 

3.0 

2 


1,000 


tangle (or zone of dispersion ) be divided into eight equal 
parts, the distribution of shots within these sections will be 
as indicated in Fig. 39. (In any given case there are likely 


2 

7 

16 

25 

25 

16 

7 

2 


Fig. 39. — Zone of Dispersion, Artillery Firing, Showing the Theoretical 
Percentage Distribution of Shots 


to be slight departures from this order, but in the long run 
this distribution will prevail.) 

This general rule holds for all classes of guns. The more 
accurate the gun the smaller will be the zone of dispersion, 
but the distribution within this zone is theoretically the 
same in all cases. Rules of fire used in artillery adjustment 
are based upon this fact. 
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The results of actual firing may be contrasted with this 
theoretical distribution. Table 18 presents a record of one 
thousand shots fired from a battery gun at the middle of a 
stationary target two hundred yards distant. 1 The target 
was divided by horizontal lines into eleven equal divisions. 

Table 18 

Thousand Shots from a Single Gun 
Number of shots recorded 
1 
4 
10 
89 
190 
212 
204 
193 
79 
16 
2 

1,000 

These results are presented graphically in Fig. 40. 

The zone of dispersion being divided into eleven divisions 
instead of the eight referred to in describing the theoretical 
distribution, a direct comparison cannot be made. We 
have here, however, the same general type of distribution 
found in the other examples given. A tendency toward 
concentration in the lower half of the target reflects a slight 
departure from symmetry. 

When coins are tossed the distribution of heads and tails 
is assumed to be determined by pure chance. In a single 
experiment ten coins were tossed 100 times. The following 
table shows the frequencies with which given numbers of 
heads appeared. (The greatest number of heads possible 

1 This experiment is recorded in the Report of the Chief of Ordnance, 1878, 
Appendix S. The results are given in The Method of Least Squares, Mansfield 
Merriman, New York, Wiley, 1897, 14. 


Distribution of One 

Division 

1 (top) 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 (bottom) 



Number of Shots 
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in a given throw under such conditions is, of course, 10; 
it is also possible that no heads should appear.) 



Divisions 


Fig. 40. — Column Diagram: Distribution of 1,000 Shots from a Single 

Gun 


Table 19 


Distribution of Results in Coin Tossing Experiment 
(Ten coins tossed 100 times) 

Number of heads Frequency of occurrence 


10 

9 

8 

7 

6 

5 

4 

3 

2 

1 

0 


0 

1 

4 
7 

23 

30 

20 

9 

5 
1 
0 


100 
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Figure 41 depicts the above frequency distribution. 



Fio. 41. — Frequency Polygon: Distribution of Heads in a Coin Tossing 

Experiment 


Distribution of Economic Data 

We find in these four widely different fields something 
approaching a uniform law of arrangement of quantitative 
data. The examples which have been given, however, do 
not represent the world of economic facts. Do economic 
data show the same general characteristics? If reference 
be made to the examples given in Chapter III, comparisons 
with the four preceding illustrations may be made. The 
frequency distributions referred to are those relating to 
weekly earnings of employees, the length of life of tele- 
phone poles, the distribution of labor cost in sawmills and 
the distribution of incomes below $4,000 in the United 
States. (The curve of the latter distribution, it should be 
noted, would show a long tail extending far to the right if 
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the incomes above $4,000 were included.) Several additional 
examples of economic data may be given. 

Figure 42 illustrates the order in which price variations are 
distributed. It is based upon a study made by W. C. Mitchell 
of 5,578 individual cases of change in the wholesale prices 



Percentage of Fall Percentage of Rise 


Fig. 42. — Frequency Polygon: Distribution of 5,540 Cases of Change 
in the Wholesale Prices of Commodities from One Year to the Next 
(after Mitchell) 

of commodities from one year to the next. 1 Thus, for ex- 
ample, the average price of middling upland cotton in 
New York in a given year was $0,115 per pound. In the 
following year the average price was $0,128 per pound, an 
increase of 11.3 per cent. This would constitute one entry 
in the table of rising prices, falling in the class “ 10-11.9%.” 
The entire table consists of 5,578 such entries. These data 
are presented in Fig. 42 in the form of a frequency polygon, 
no attempt being made to smooth the curve. 

1 From Bulletin 284, U. S. Bureau of Labor Statistics, Part I, “The Making 
and Using of Index Numbers,” 18. The figure shows the price changes only 
within the range of a 51 per cent fall and a 51 per cent rise. One case of a 
price fall of 55 per cent is not shown, and 37 cases of price increases ranging 
from 52 per cent to 104 per cent have not been included. 
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Table 20 shows the distribution of London-New York 
exchange rates (sterling exchange) from 1882 to 1913, in- 
clusive. This was a period when both currencies were freely 
convertible into gold, at fixed ratios, with customary market 
forces operating to keep exchange rates between the two 
“gold points.” Observations covering recent decades would 
show quite different characteristics. In the distribution 
shown graphically in Fig. 43 monthly rates have been 



Fro. 43. — Frequency Polygon: Distribution of London-New York 
Exchange Rates (as recorded over a period of 384 months) 

classified according to the frequency of their occurrence over 
thirty-two years of pre-war experience. 1 

A fairly typical distribution of wage-earners classified 
according to the amount of their weekly earnings, is shown 
in Table 21 and, graphically, in Fig. 44. The data relate to 
13,427 steel workers in open-hearth furnaces, in the United 

1 “The figures are . . . the averages of those quoted at the beginning of each 
month in the Economist; on and after July, 1886, the exchange is the ‘tele- 
graphic transfer,' before that date, ‘short at interest.' ” The data are taken from 
An Academic Study of Some Money Market and Other Statistics , by E. G. 
Peake. London, P. S. King, 1923. Appendix I. 


Frequency 


Table 20 

Distribution of London-New York Exchange Rates as Recorded by 
Months during the Period 1882-1913 


Class-interval 


$4.8275-$4.8324 
4.8325- 4.8374 
4.8375- 4.8424 
4.8425- 4.8474 
4.8475- 4.8524 
4.8525- 4.8574 
4.8575- 4.8624 
4.8625- 4.8674 
4.8675- 4.8724 
4.8725- 4.8774 
4.8775- 4.8824 
4.8825- 4.8874 
4.8875- 4.8924 
4 . 8925- 4.8974 
4.8975- 4.9024 
4.9025- 4.9074 
4.9075- 4.9124 


Frequency 

( number of months given 
rate 'prevailed ) 

1 

6 

11 

21 

23 

24 

25 
40 
45 
49 
35 
45 
33 
16 

8 

1 

1 


384 


5.000 

4.000 

3.000 

2.000 

1,000 



8 16 24 32 40 48 56 64 72 80 88 96 104112120 
Class-Interval (in dollars per week) 


Pig. 44. — Distribution of Wage-Earners in Open-Hearth Furnaces* 
Classified according to Average Weekly Earnings in 1935 
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States, in 1935. There is a clear concentration of workers 
whose earnings fall between $16 and $24 a week. The dis- 
tribution is markedly skewed, however, with a tail extending 
far to the right. The range of weekly earnings, like that of 
incomes in general, is far greater above the mode than 
below. 


Table 21 

Distribution of Wage-Earners in Open-Hearth Furnaces in the United 
States, Classified According to Average Weekly Earnings in 1935 
(Total for all districts) 


Class-interval 

Frequency 

(in dollars 

(■ number of workers 

per week) 

earning stated amount) 

$ 0-3 7.99 

583 

8 - 15.99 

2,200 

16 - 23.99 

4,462 

24 - 31.99 

3,032 

32 - 39.99 

1,527 

40 - 47.99 

764 

48 - 55.99 

358 

56 - 63.99 

210 

64 - 71.99 

144 

72 - 79.99 

44 

80 - 87.99 

36 

88 - 95.99 

21 

96 - 103.99 

26 

104 - 111.99 

3 

112 - 119.99 

7 

120 - 127.99 

1 

128 - 135.99 

9 


13,427 


The frequency curves and histograms based upon eco- 
nomic data, it will be noted, do not all show the symmetry 
and regularity which seem to characterize the curves rep- 
resenting physical data. Some are non-symmetrical, showing 
a preponderance of cases on one side of the point of greatest 
concentration. In some there are breaks in the regularity of 
the increase or decrease of frequencies. But in spite of 
these differences there is obviously a family resemblance 
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between the measurements drawn from the fields of eco- 
nomics, astronomy, anthropometry, ballistics, and pure 
chance. Certain of the common characteristics may be noted. 

General Characteristics of Frequency Distributions 

There is, in the first place, variation in the values of the 
measurements secured. Human heights vary, astronomical 
measurements of the same quantity differ, projectiles fired 
under conditions as nearly constant as it is humanly possible 
to make them fail to land at the same spot, incomes vary 
as between individuals, and exchange rates move from week 
to week and month to month. The various observations or 
values secured in a given case are distributed along a scale, 
between two extreme values. 

The distribution of these values along the scale (the 
x-axis) is such that, moving from one extreme value to- 
wards the other, the cases found at successive points along 
the scale (the successive class frequencies) increase with 
more or less regularity up to a maximum, and then de- 
crease in much the same way. In spite of variation, there- 
fore, we find a central tendency, a massing of cases at certain 
points on the scale of values. This is the second notable 
characteristic which all the frequency distributions appear 
to possess in common. 

If we measure, for each of the successive classes, the 
amount of deviation along the scale from the point of 
greatest concentration it will be noted that small deviations 
are much more frequent than large ones, that extreme 
deviations are rare, and that deviations on both sides of 
the point of concentration reach perfect (or almost perfect) 
equality in the examples taken from the physical sciences 
and from the field of pure chance, and approximate equality 
in the economic distributions. (Exceptions to this rule 
of approximate equality on the two sides of the point of 
greatest concentration are not infrequent, the example of 
income distribution being a rather striking case in point.) 
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Figure 45 depicts a curve which is termed the “proba- 
bility curve,” or the “normal curve of error.” Its charac- 
teristics will be discussed in greater detail in a later section. 
At this point it is presented merely as a basic type which 
some of the above examples approach closely, and from 
which others of the examples represent more or less pro- 
nounced deviations. Departures from this type, let it be 
emphasized, are numerous and significant, but as a basic 



— 3tf — 2tf -ltf 0 +ltf + 2tf +3 tf 


Fig. 45. — The Normal Curve of Error 

form this normal curve of error is extremely important in 
statistical work. Even the most important variations from 
this type resemble it with sufficient closeness to justify the 
use of a generalized method of describing frequency distri- 
butions. Distributions of quantitative data vary, and their 
variations from each other and from certain standard types 
are of the greatest significance, but in spite of their varia- 
tions a family resemblance runs through them all. Each 
new frequency distribution is not an isolated phenomenon, 
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but a member of a large family, and as such the problem 
of describing and analyzing it may be approached with 
confidence in methods which have been found applicable in 
other cases. 

Given this more or less common type, how may a given dis- 
tribution be described and differentiated from others? Certain 
methods will have been suggested by the preceding discussion. 

Methods of Describing a Frequency Distribution 

The values of all the observations, it has been noted, are 
spread along a scale. The frequency distribution may be 
described by the selection of a single value on that scale 
which is thoroughly representative of the distribution as a 
whole. Since the frequencies vary, an obvious choice is 
the selection of that value which occurs the greatest number 
of times, or, in other words, that point on the scale at 
which the concentration is greatest. This value consti- 
tutes a measure of the central tendency of the distribution. 
Thus, one might find the income class in which the greatest 
number of people fall, and let the mid-point of that class 
(which is $950 in the distribution presented in Table 11) 
serve as the representative of the distribution. This most 
common value, it should be noted, is only one of several 
possible measures of the central tendency of a given dis- 
tribution. All such measures are termed averages. 

A single representative value such as this has many uses 
but, by itself, it obviously leaves out many facts concern- 
ing the distribution. Of great importance is the character 
of the distribution about the average. Are the values of all 
tabulated cases closely concentrated, or is there pronounced 
dispersion over a wide range? The representative character 
of any average depends upon how closely the other values 
cling to it, upon the degree of concentration about the 
central tendency. The average, therefore, must be supple- 
mented by a measure of variation, a measure of the “scatter" 
about the central value. 
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An adequate description should include also an account 
of the degree of symmetry of the distribution. It is highly 
important to know whether there is an equal distribution 
of cases on each side of the point of greatest concentration, 
or whether the frequency curve is skewed to one side, as in 
the case of income distribution illustrated above. If the 
curve is not symmetrical the degree of asymmetry should 
be determined, and for this purpose measures of skewness 
have been developed. 

It is, finally, possible to measure the degree of peaked- 
ness of frequency curves, by comparing them with the 
normal curve of error as a standard. It is obvious that 
the frequency polygon representing price changes (Fig. 41) 
would, if smoothed, constitute a curve much more peaked 
than the normal curve, and this fact of pronounced con- 
centration at the central value is highly significant. This 
characteristic of frequency curves is called kurtosis, and 
the measurement of kurtosis constitutes the final step in 
the description of the frequency distribution. 

When these various measures have been secured the task 
of statistical analysis will be well under way. The chaotic 
assortment of data with which we started will have been 
reduced to workable form in the shape of a frequency table, 
and the essential facts which the table reveals will have 
been distilled into three or four significant measures. This 
process not only reveals the characteristics of the given 
distribution, but also facilitates comparison with similar 
distributions. For example, it is impossible to compare 
some tens of millions of unorganized personal income figures 
for the United States with similar data for Great Britain. 
But if we secure a value for the average or most repre- 
sentative income for each country, together with a descrip- 
tion of the distribution of personal incomes about that 
central value, a legitimate basis for comparative study is 
obtained. In manipulating and analyzing masses of ma- 
terial, whatever the purpose of study may be, full use 
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should be made of the power to condense, simplify and 
compare which is given by the measures employed in de- 
scribing the frequency distribution. 

The succeeding section is devoted to a discussion of one 
phase of this descriptive process, that concerned with the 
measurement of central tendencies. After the development 
of this subject of averages, problems relating to measures 
of variation and of skewness will be dealt with. 

Averages 

We have seen that the representation of a frequency 
distribution by an average, a single typical figure, is justi- 
fied because of the tendency of large masses of figures to 
cluster about a central value, from which the values of all 
observed cases depart with more or less regularity and 
smoothness. It is solely because of the concentration of 
cases about a central point on the scale that such repre- 
sentative figures have significance. The average represents 
the distribution as a whole only because it is a typical 
value. If the individual items entering into a distribution 
vary widely in value and show no tendency toward con- 
centration, no single value can represent them. Thus the 
arithmetic mean of the three numbers 3, 125, 1,000 is 376, 
but 376 in no way represents the three values on which it 
is based. This fundamental requirement, that there be a 
tendency toward concentration about a central value, must 
be met if an average is to be at all representative. 

If the general character of a frequency distribution be 
recalled the logic of one sort of average will be clear at 
once. It was suggested above that that point on the x-scale 
at which the concentration is greatest, that value which 
occurs the greatest number of times, might be taken as 
typical of the entire distribution. This value is termed 
the mode, and the group in which it falls is called the modal 
group. If a frequency curve be drawn to represent a given 
distribution, the mode will be the x-value corresponding to 
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the maximum ordinate. 1 The maximum ordinate itself meas- 
ures the frequency of the modal group. Students frequently 
confuse these two values in determining the mode. It is 
not the distance along the y - scale but the distance along 
the x-scale which measures the value of the mode. The 
ordinates merely measure the number of cases falling in the 
several classes, not the values of the cases falling in those 
classes. 

As typical of a given distribution we might also select 
that point on the scale of x-values on each side of which 
one half the total number of cases fall. This value, which 
is called the median, is that which exceeds the values of 
one half the cases included, and is in turn exceeded by the 
values of one half the cases. Thus it has been estimated 
that in 1918 the median value of personal incomes in the 
United States was $1,140; one half of the 37 million recipi- 
ents of personal incomes received less than this sum, while 
one half received more. When a distribution is represented 
by a frequency curve, the area under the curve is divided 
into two equal parts by an ordinate erected at that point 
on the x-axis corresponding to the median value. This 
follows, of course, from the definition of the median, and 
from the fact that the area under a frequency curve repre- 
sents the total number of cases included in the distribution. 

The arithmetic mean is a third type of average which 
may be used to represent a distribution. This is a calcu- 
lated average, affected by the value of every item in the 
distribution. Herein, obviously, it differs from the mode 
and the median, which depend primarily upon the relative 
position of the items in the frequency table, and are not 
affected by the values of all individual items. The arith- 
metic mean is the center of gravity of a distribution; it 
would be the x-value of the point of balance of a frequency 

1 Strictly speaking, the mode is the z-value corresponding to the maximum 
ordinate of the ideal frequency curve which has been fitted to the given distri- 
bution. 
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curve, if the curve could be blocked out and manipulated 
in solid form. 

The geometric mean and the harmonic mean are two other 
averages the characteristics of which will be discussed at a 
later point. 

The computation or location of these various averages 
may involve somewhat lengthy processes if the number of 
cases included be great. If appropriate methods be em- 
ployed, however, the labor of computation may be materi- 
ally cut down. The use of the following symbols will sim- 
plify the explanation of these methods: 

M: Arithmetic mean. 

Mo: Mode. 

Md: Median. 

m: The value of an individual observation; in a fre- 
quency distribution, the value of the midpoint of 
a class. 

f: The number of items (observations) in a given class 
in a frequency distribution. 

N: The total number of items in a given series or fre- 
quency distribution. 

2 (Sigma) : The symbol for the process of summation, meaning 
“the sum of. ” 


The Computation of the Arithmetic Mean 

Using the above notation, the formula for the arithmetic 
mean is 

,, _ 2m 
M ~ N" 

Thus the mean of the measures 2, 5, 6, 7, is equal to the 

20 

sum of these measures divided by 4, which is -j- or 5. The 

computation of the arithmetic mean when each measure is 
reported at its true value is thus a simple process of sum- 
mation and division. The weekly earnings of 210 factory 
employees were listed in an earlier section. If these figures 
be added, and the total divided by 210, the mean weekly 
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wage is found to be $26,983. In this case the task of add- 
ing 210 items is somewhat tedious; it is a task which would 
become almost impossible if one were dealing with the 
37 million personal income figures, for example. For prac- 
tical reasons, therefore, it is usually necessary to compute 
the required averages from the frequency distribution rather 
than from the original ungrouped data. To exemplify this 
process we may utilize data relating to the weekly earnings 
of steel workers in the Pittsburgh District in 1935. 

The importance of certain of the precautions mentioned in 
the section on classification, in connection with the choice 
of a class-interval, will be clear from this example. When 
the mean of a distribution is calculated from classified ob- 
servations, we must assume an even distribution of cases 
within each class. The class-interval should be selected 
with this in mind, in order that errors introduced by the 
assumption may be minimized. If the items in each class 
are evenly distributed, the mid-value of each class may be 
taken as representative of all the observations included; 
when such a mid-value is multiplied by the number of items 
in the class, the product is approximately equal to the sum 
of all the individual items in the class. The formula for the 

mean thus becomes M = Table 22 illustrates the 

procedure in detail. 

The value secured in this way is sometimes called a 
weighted arithmetic mean. What we do, in effect, is to 
secure the arithmetic mean of the 28 figures in the column 
headed m. We do not take a simple average of these fig- 
ures, however, but weight each one in proportion to the 
number of cases falling in the class-interval of which it is 
the mid-value. It is precisely the procedure we should fol- 
low in calculating the mean of five men’s incomes, two of 
whom, let us say, have incomes of $2,000 and three of whom 
have incomes of $3,000. Clearly it would not do to add the 
figures $2,000 and $3,000, dividing the sum by two. The 
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Table 22 1 

Calculation of the Arithmetic Mean of Weekly Earnings of Workers 
in Open-Hearth Furnaces in the Pittsburgh District in 1935 


Class-interval 

Mid-point 

Frequency 

fm 

, dollars per week) 

m 

f 

* 0-$ 3.99 

2 

67 

134 

4- 7.99 

6 

290 

1,740 

8- 11.99 

10 

437 

4,370 

12- 15.99 

14 

730 

10,220 

16- 19.99 

18 

1,056 

19,008 

20- 23.99 

22 

1,009 

22,198 

24- 27.99 

26 

712 

18,512 

28- 31.99 

30 

609 

18,270 

32- 35.99 

34 

334 

11,356 

36- 39.99 

38 

187 

7,106 

40- 43.99 

42 

179 

7,518 

44- 47.99 

46 

105 

4,830 

48- 51.99 

50 

60 

3,000 

52- 55.99 

54 

67 

3,618 

56- 59.99 

58 

28 

1,624 

60- 63.99 

62 

37 

2,294 

64- 67.99 

66 

33 

2,178 

68- 71.99 

70 

29 

2,030 

72- 75.99 

74 

16 

1,184 

76- 79 99 

78 

8 

624 

80- 83.99 

82 

3 

246 

84- 87.99 

86 

8 

688 

88- 91.99 

90 

4 

360 

92- 95.99 

94 

7 

658 

96- 99.99 

98 

9 

882 

100-103.99 

102 

5 

510 

104-107.99 

106 

1 

106 

108-111.99 

110 

1 

110 

Total 


6,031 

145,374 

M 

2(fm) $145,374 

N 6,031 

= $24.1045. 



1 These figures and similar data appearing in subsequent tables were com- 
piled by Edward K. Frazier, of the Division of Wages, Hours and Working 
Conditions, U. S. Bureau of Labor Statistics. See “Earnings and Hours in 
Blast Furnaces, Bessemer Converters, Open-Hearth Furnaces and Electric 
Furnaces, 1933 and 1935” Monthly Labor Review , April, 1936. The detailed 
statistics in Table 22 were provided through the courtesy of Dr. Isador Lubin 
Commissioner of Labor Statistics. 
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figure $2,000 is given a weight of two, the figure $3,000 is 
given a weight of three, and the resultant sum, $13,000, is 
divided by five. Though the procedure in working from the 
frequency distribution is thus a form of weighting, the term 
“weighted average” is coming to have a more restricted 
meaning, to be explained at a later point, and should not 
in general be applied to an average computed from a fre- 
quency distribution. 

SHORT METHOD OF COMPUTING THE ARITHMETIC MEAN 

The calculation of the arithmetic mean from the fre- 
quency table is much easier, in general, than from the un- 
grouped data, but when the number of cases included is 
large even the computation from the frequency table by 
the method illustrated above may be laborious. The pro- 
cedure may be greatly simplified. 

From the method of computing the arithmetic mean it 
follows that t he algebraic sum of the devi ations of a series 
of i ndividual m agn itudes from their mean is zero. This 
may be readily demonstrated. We may represent the series 
of magnitudes by Wi, m 2 , m s , . . . m„, their arithmetic 
mean by M, and the deviations of the various magnitudes 
from the mean by di, d 2 , d 3 , . . . d„. 

Then 

wii + m* + m 3 + . . . + m„ 

= M (1) 

and 

m t + ms + m 3 + . . . + m n = JVM. (2) 

The number of terms, of course, is equal to N. Therefore, 
subtracting M N times from each side of the equation, 

(mi — M) + (m* — M) + (m 3 — M) + . . . + (m, — M) = 0. (3) 

But 

mi — M = di, m® — M = d 2 , etc., and equation (3) may be written 

Zd = 0. 

Knowing this to be true we may measure the deviations 
of a series of magnitudes from any arbitrary quantity, 
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secure the algebraic sum of the deviations, and from this 
value ascertain the difference between the arbitrary quan- 
tity and the true mean. For this difference will be the 
mean of the deviations from the arbitrary origin. If we 
let M' represent the arbitrary origin, or assumed mean, 
while c = M — M', and di, di, d 3 ' . . . d n \ represent the 
deviations of the various magnitudes from M' (i.e., di 
= m\ — M', di = m .2 — M', etc.), then 

d\ — di “b c, di = di -f* c, d 3 — d% -|- c, . . . d n ‘ — dn d - c 
and 

S d' = S d + Nc. 

But 

2d = 0 
.-. 2d' = Nc 

and 



From the known values of M' and c the value of the true 
mean may be obtained, for M - M' + c. The procedure is 
illustrated in the following simple example : 


Table 23 


Computation of the Arithmetic Mean ( Short Method) 


(Ungrouped data) 


m 

/ 

d’ 


5 

1 

- 15 

8 

II 

15 

1 

- 5 

2d' 

25 

1 

+ 5 

N 

35 

1 

+ 15 

M = M' 

45 

1 

+ 25 



+ 25 


= 5 


+ 25 


When the deviations are measured from 20 as arbitrary 
origin there is in each case a constant error, if the devia- 
tion from the true mean be taken as standard. This error 
is equal to the difference between the true and the assumed 
means. The algebraic sum of the deviations from the 
assumed mean will equal N times this constant error, since 
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the error is repeated once for every item included. By 
dividing the sum of these deviations by N the amount of 
the error may be determined and the value of the mean 
thus obtained. 


Table 24 

Calculation of the Arithmetic Mean of Weekly Earnings of Workers 
in Open-Hearth Furnaces in the Pittsburgh District in 1935 
(Short method) 


Class- 

interval 

Mid- 

Fre- 

d! 

(in class- 


fd' 

Calculations 

(in dollars 

point quency 

interval 

- 

4- 

M' = $30 

per week) 

m 

f 

units ) 




$ 0-$ 3.99 

2 

67 

~ 7 

469 


1. Algebraic sum of devi- 

4- 7.99 

6 

290 

- 6 

1,740 


ations from M r 

8- 11.99 

10 

437 

- 5 

2,185 


- 13,212 

12- 15.99 

14 

730 

- 4 

2,920 


4- 4,323 

16- 19.99 

18 

1,056 

- 3 

3,168 


— 8,889 

20- 23.99 

22 

1,009 

2 

2,018 



24 — 27.99 

26 

712 

- 1 

712 


2. Calculation of c (in 

28- 31.99 
32- 35.99 

30 

34 

609 

334 

0 

4- 1 


334 

class-interval units) 

36- 39.99 

38 

187 

+ 2 


374 

= = _ x. 47388 

40- 43.99 

42 

179 

4- 3 


537 

6,031 

44- 47.99 

46 

105 

4- 4 


420 


4^- 51.99 

50 

60 

4- 5 


300 

3. Reduction of c to orig- 

52- 55.99 

54 

67 

4- 6 


402 

inal units 

56- 59.99 

58 

28 

4- 7 


196 

Class-interval — $4 

60- 63.99 

62 

37 

4- 8 


296 

c (in original units) 

64- 67.99 

66 

33 

4- 9 


297 

* - 1.47388 X $4 

68- 71.99 

70 

29 

4- 10 


290 

= $- 5.8955 

72- 75.99 

74 

16 

4- 11 


176 


76- 79.99 

78 

8 

4* 12 


96 

4. Determination of M 

80- 83.99 

82 

3 

4- 13 


39 


84- 87.99 

86 

8 

4- 14 


112 

M — M 4~ c 

88- 91.99 

90 

4 

4- 15 


60 

M = $30 - $5.8956 

92- 95.99 

94 

7 

4- 16 


112 

M = $24. 1045/ 

96- 99.99 

98 

9 

4- 17 


153 

i/ 

100-103.99 

102 

5 

4- 18 


90 


104-107.99 

106 

1 

4- 19 


19 


108-111.99 

110 

1 

4- 20 


20 


Total 


6,031 


- 13,212 

4* 4,323 



The work of computation may be still further abbrevi- 
ated, for observations arranged in the form of a frequency 
distribution, by measuring the deviations in terms of the 
class-interval as a unit. Then, in finally applying the neces- 
sary correction, the difference between the true and assumed 
means may be again expressed in terms of the original units. 
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The method may be illustrated in detail with reference to 
the wage data for which the mean has already been calcu- 
lated. 

The steps in this process of calculating the arithmetic 
mean by the short method may be briefly summarized: 

1. Organize the data in the form of a frequency distribution. 

2. Adopt as the assumed mean the midpoint of a class near the 

center of the distribution. 

3. Arrange a column showing the deviation (d') from the assumed 

mean of the items in each class, in terms of class-interval 
units. This deviation will be zero for the items in the class 
containing the assumed mean, — 1 for the items in the next 
lower class, + 1 for the items in the next higher class, and so 
on. 

4. Multiply the deviation of each class by the frequency of that 

class, taking account of signs. These products are entered 
in the column Jd ! . 

5. Get the algebraic sum of the items entered in the column fd'. 

6. Divide this sum by the total frequency (i\T). The quotient is 

the correction (c) in class-interval units. 

7. Multiply the correction (c) by an amount equal to the class- 

interval. The product is the correction in terms of the 
original units. 

8. Add this correction (algebraically) to the assumed mean ( M '); 

the sum is the true mean (ill). 

Location of the Median 

TJNGROUPED DATA 

The median is a value of a variable so selected that 
50 per cent of the total number of cases, when arranged in 
order of magnitude, lie below it and 50 per cent above it. 
For many frequency distributions this is a useful and sig- 
nificant value. 

When handling data which are not arranged in the form 
of a frequency distribution the location of the median is a 
simple matter. The data having been arranged in order of 
magnitude, it is necessary only to count from one end 
until that point on the scale of values is found which divides 
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the number of cases into two equal parts. As a simple 
example we may assume that the following seven figures 
represent the annual incomes of seven individuals: 

$750 $975 $1,128 $1,450 $1,475 $1,825 $1,950 

The scale of values extends from $750 to $1,950, and 
seven items are arranged along this scale. The value of 
$1,000 has two items on one side and five items on the 
other, so obviously does not conform to our definition of 



Fig. 46. — Illustrating the Location of the Median with Ungrouped Data 
(Personal incomes of seven individuals) 


the median. The value of $1,450, which corresponds with 
the income of one of the seven individuals, is the median 
in this case. Three items lie on each side of this value; or, 
if we assume the central item to be cut in two, 3§ items he 
on each side of this point. This case is illustrated in Fig. 46. 
This diagram may help to bring out the fact that the 
median is a point on a scale so located that it cuts the 
frequencies in two. 

The problem is slightly different when an even number of 
cases is included. This condition is exemplified in the table 
on page 111 which shows the average earnings per man-hour 
in each of 38 selected industries during the year 1933. 

In this case the median must be a value on each side of 
which 19 industries lie. Therefore any value exceeding 
$0,425 (average earnings in the prepared feed industry) 
and less than $0,426 (average earnings in the meat packing 
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Table 25 

Average Earnings per Man-IIour in Selected Manufacturing 

Industries 1 


Silk and rayon goods: Commission throwing 

Average wage 
per man-hour 
$.278 

Cotton goods 

.279 

Cigars 

.299 

Silk and rayon goods: Commission weaving 

.313 

Silk and rayon goods: Regular throwing 

.316 

Knit underwear 

.319 

Knit outerwear 

.358 

Cigarettes 

.361 

Silk and rayon goods: Regular weaving 

.369 

Wool shoddy 

.370 

Hosiery 

.372 

Cotton small wares 

.378 

Woolen goods 

.395 

Sugar, beet 

.395 

Worsted goods 

.399 

Snuff, and chewing and smoking tobacco 

.402 

Knit cloth 

.414 

Rayon yarns 

.421 

Feeds, prepared 

.425 

Meat packing 

.426 

Pulp 

.431 

Ice, manufactured 

.436 

Flour milling 

.444 

Paper 

.445 

Carpets and rugs, wool 

.464 

Leather tanning 

.470 

Sugar refining, cane 

.481 

Soap 

.482 

Blast furnaces 

.488 

Felt goods 

.488 

Cereal preparations 

.510 

Steel works 

.519 

Motor vehicle bodies and parts 

.561 

Machine tools 

.585 

Motor vehicles 

.610 

Machine-tool accessories 

.621 

Petroleum refining 

.643 

Malt 

.657 


1 From Monthly Labor Review, October, 1936, 910. 
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industry) will satisfy the definition of a median. Under 
these conditions, where the median is really indeterminate, 
a value half-way between the two limiting values is accepted, 
by convention. The median of the 38 figures would thus 
be $0.4255. 

In this example the median value does not correspond 
with the earnings in any one industry. This will frequently 
be so when there is an even number of observations. 


GROUPED DATA 

The task of locating the median is essentially the same 
when the data are in the form of a frequency distribution. 
The fact that the real values of the individual items are 
not known, because of the grouping by classes, complicates 
the problem slightly. The data in Table 26, relating to 
advertising rates of daily newspapers in the United States, 
may be used in illustrating the method. 


Table 26 


Location of Median , Newspaper Advertising Rates in 1933 
Minimum Line Rates for National Advertising, 245 Daily Newspapers 
in Cities of 25,000 to 50,000 Population 1 


Class-interval 

No. of newspapers 

Rate per line 

charging stated 

(in cents) 

rate 


f 

1.0- 2.99 

6 

3.0- 4.99 

53 

5.0- 6.99 

85 

7.0- 8.99 

56 

9.0-10.99 

21 

11.0-12.99 

16 

13.0-14.99 

4 

15.0-16 99 

4 


245 


N 

2 


245 


= 122.5 
63.5 


211^ = 5.0 + (^X2.o) 


5.0 + 1.49 
6.49 


In the present case the location of the median involves 
the determination of that value on each side of which 122.5 
1 Source: Editor and Publisher, International Yearbook for 1933. 
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items lie. We may assume that we start at the lower end 
of the scale and move through the successive classes. When 
we reach the upper limit of the first class (that including 
items having values from 1 . 0 to 3 . 0) we have left behind 
us 6 cases, while 239 lie in front of us. When the upper 
limit of the second class is attained, 59 items have been 
passed. The upper limit of the third class has below it 
144 items. Somewhere between the lower and upper limits 
of the third class lies the desired point, that which has 
122.5 items on each side of it. How far must we move 
through this class, from 5.0 to 7.0 in order to reach this 
point? 

It will be recalled that, for purposes of calculation, the 
assumption is made that there is a uniform distribution of 
the items lying within any given class. Since before we 
reach the third class 59 cases have been counted, only 63 . 5 
of the 85 included in this class are needed to complete the 
desired number, 122.5. On the assumption of even distri- 
bution the required 63.5 cases will lie within a distance 
63 5 

on the scale equal to of the class-interval. The class- 

85 

63 5 

interval is 2.0; of 2.0 is equal to 1.49. As we move 
85 

up the scale, then, having reached 5.0, we proceed an addi- 
tional distance equal to 1 .49. At a point on the scale having 
a value of 6.49 is the dividing line on each side of which 
lie 122.5 cases. This is the value of the median. 

The process of computation is shown at the right of the 
frequency table. The following is a summary of the steps 
involved in the location of the median: 

1. Arrange the data in the form of a frequency distribution. 

2. Divide the total number of measures by 2; this gives the 

number which must lie on each side of the point to be located. 

3. Begin at the lower end of the scale and add together the fre- 

quencies in the successive classes until the lower limit of the 
class containing the median value is reached. 
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4. Determine the number of measures from this class which must 

be added to the frequencies already totaled to give a number 
equal to N/2. 

5. Divide the additional number thus required by the total 

number of cases in the class containing the median. This 
indicates the fractional part of the class-interval within which 
the required cases lie. 

6. Multiply the class-interval by the fraction thus set up. 

7. To the lower limit of the interval containing the median add 

the result of the multiplication process indicated in (6). 
This gives the value of the median. 

The last three steps constitute merely a simple form of 
interpolation. 

The entire process may be reversed by beginning at the 
upper end of the scale and counting downwards. In this 
case the final operation is one of subtraction from the upper 
limit of the interval containing the median. 

N/2 may be a fractional value, as in the example given, 
or a whole number. The operation is precisely the same in 
the two cases. 


Quartiles and Decides 

For many purposes it is desirable to locate on the scale 
of values, along which the items constituting a frequency 
distribution are ranged, points dividing the total number 
of measures in other ways. Similar to the median, which 
divides the total number of cases into two equal groups, 
are the quartiles, deciles, and percentiles. The quartiles, 
as the term implies, are points on the scale which divide 
the entire number of measures into four equal groups, the 
deciles divide the number into ten equal groups, and the 
percentiles divide the total number of cases into 100 equal 
groups. Thus the first quartile is that point on the scale 
below which one quarter of the total number of cases lie 
and above which three quarters of the total number of 
cases lie. The second quartile and the median are identical 
values. The third decile is that point on the scale below 
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which three tenths of the total number of cases lie and 
above which seven tenths of the total number of cases lie. 
In all cases the count begins at the lower end of the scale. 

Example: Location of the First Quartile (Q i), Newspaper Advertising Rates 

(See Table 26) 

JV/4 = 61.25 

Qi = 5.0 + (2.25/85 X 2.0) 

= 5.05 

Example : Location of Eighth Decile (D 8 ), Newspaper Advertising Rate* 

(See Table 26) 

Nf 10 = 24.5 D b = 7.0 -f (52/56 X 2.0) 

8iVyiO = 196 =8.86 

A method of locating median, quartiles, deciles and per-* 
centiles graphically is explained below. 

Location of the Mode 

The mode is the value of the x-variable corresponding to 
the maximum ordinate of a given frequency curve. The 
concept of a modal value is a thoroughly easy one to grasp. 
It is the most common wage, the most common income, 
the most common height. It is the point where the con- 
centration is greatest, a characteristic which is effectively 
brought out by Fechner’s term for this average, dichtester 
wert, or thickest value. It is not so easy, however, to locate 
the true modal value in a given case. In general statistical 
work an approximate value only is secured for the mode, 
but for most practical purposes this value is usually suf- 
ficiently accurate. 1 

The method of determining this approximate modal value 
may be illustrated by reference to the distribution shown 
in Table 27 on page 116. 

There is wide dispersion of the 22 cases falling below 40, 
and the existence of this “open-end” class makes it impos- 
sible to compute the mean, as the table stands. The mode 

1 A method of locating the mode more accurately is explained in a later 
seotion. 
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Table 27 

Frequency Distribution of Five Per Cent Bonds 

(This table is based upon quotations on the New York Stock Exchange on 
June 13, 1936, on railroad and industrial bonds with coupon rate of 5 per cent) 


Quoted price 

Mid-point 

Frequency 

Class-interval 

m 

f 

Less than 40 


22 

40- 49.9 

45 

5 

50- 59.9 

55 

5 

60- 69.9 

65 

3 

70- 79.9 

75 

8 

80- 89.9 

85 

9 

90- 99.9 

95 

19 

100-109.9 

105 

49 

110-119.9 

115 

10 

120-129.9 

125 

3 

130-139.9 

135 

1 



134 


is therefore an appropriate average to employ in the present 
instance. 

The class having limits of 100-109 . 9 contains the greatest 
number of cases. This appears to be the modal group, and 
the mid-point of this class, 105, may be tentatively accepted 
as the value of the approximate mode. But with different 
classifications quite different values might be secured for 
the mode. When the original bond quotations are tabulated 
with varying class-intervals the following results are secured. 
(Only the frequencies of the central classes are shown. It 
is not necessary, for this purpose, to present each of the 
tables as a whole.) 


(a) 

Class-interval — 5 
Class-interval f 
80 - 84.9 3 

86 - 89.9 6 

90 - 94.9 10 

96 - 99.9 9 

100 - 104.9 29 

105 - 109.9 20 

110 - 114.9 7 

115 - 119.9 3 


(b) 

Class-interval = 2.6 
Class-interval f 

90 . 0 - 92.49 4 

92 . 6 - 94.99 6 

95 . 0 - 97.49 2 

97 . 5 - 99.99 7 

100 . 0 - 102.49 9 

102 . 6 - 104.99 20 

105 . 0 - 107.49 13 

107 . 5 - 109.99 7 


(c) 

Class-interval = 2.5 
Class-interval f 
98 . 76 - 101.249 6 

101 . 26 - 103.749 17 
103 . 75 - 106.249 20 
106 . 25 - 108.749 8 


(d) 

Class-interval = 1 
Class-interval / 

100 - 100.9 1 

101 - 101.9 2 

102 - 102.9 9 

103 - 103.9 10 

104 - 104.9 7 

105 - 105.9 6 

106 - 106.9 5 

107 - 107.9 4 
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With a class-interval of 5 a value of 102 . 5 is secured for 
the mode; with a class-interval of 2.5 a value of 103.75 is 
obtained. A class-interval of 2.5, again, but with different 
class limits, yields a mode of 105. Finally, a class-interval 
of 1 gives a mode of 103 . 5. Further changes in classification 
would give still other values. The mode thus appears to be 
a curiously intangible and shifting average. Its value, for 
the same data, seems to vary with changes in the size of the 
class-interval and in the location of the class-limits. 

These difficulties arise primarily from limitations to the 
size of the sample being studied. The true mode, that 
value which would occur the greatest number of times in 
an infinitely large sample, could be located exactly if we 
could increase indefinitely the number of cases included. 
For, given sufficient cases, the approximate mode approaches 
the true mode as the class-interval decreases. Grouping in 
large classes obscures details, and as these classes are re- 
duced in size more of the details are seen and a truer picture 
of the actual distribution is secured. But since most prac- 
tical work is necessarily based upon relatively small samples, 
the increase in the number of classes reveals gaps and 
irregularities, and causes such a loss of symmetry and order 
that doubt arises as to where the point of greatest concen- 
tration really lies. The different tabulations of bond prices 
furnish an excellent example of this. 

By mathematical methods it is possible to obtain a value 
for the true mode without securing an infinite number of- 
cases. The smoothing process has been briefly explained. 
One sort of smoothing involves the fitting of an appropri- 
ate type of ideal frequency curve to the data of a given 
frequency distribution. This gives, theoretically, the dis- 
tribution which would be secured by the process first indi- 
cated, that of decreasing indefinitely the size of the class- 
interval and increasing indefinitely the number of cases. 
The value of the x-variable corresponding to the maximum 
ordinate of this ideal fitted curve is the true mode. 
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For most practical purposes approximate values of the mode 
are adequate, and these may be secured by much simpler 
methods. A first and rough approximation may be obtained 
by taking the mid- value of the class of greatest frequency, a 
method suggested above. If the general rules for classifica- 
tion which were outlined in an earlier section have been fol- 
lowed, this procedure will not generally involve a gross error. 

It is possible, given a fairly regular distribution, to secure, 
by a process of interpolation within the modal group, a 
closer approximation than is obtained by accepting the mid- 
value of this group as the mode. Referring again to the 
tabulation of bond prices in Table 27 it will be noted that 
the distribution on the two sides of the modal class is not 
symmetrical. The modal class is that with a mid-value of 
105. The class next below, with a mid-value of 95, contains 
19 cases, while that next above, with a mid-value of 115, 
contains but 10 cases. The disproportion is continued in 
the succeeding classes below and above, more cases being 
bulked below the modal class than above. For other pur- 
poses we have assumed an even distribution of cases between 
the upper and lower limits of each class, but it is probable 
that this is not true of the modal class in the present case. 
Judging from the distribution outside this class, it is likely 
that the concentration is greater in the lower half of the 
class-interval, that is, between 100 and 105. The mode, 
therefore, probably lies below the mid-value 105, rather 
than precisely at that point. We may attempt to locate it 
within the group by weighting, assuming a pull toward the 
lower end of the scale equal to 19 (the number in the class 
next below) and a pull toward the upper end of the scale 
equal to 10 (the number in the class next above). This may 
be expressed by a formula, employing the following symbols: 

l = lower limit of modal class. 
fi = frequency of class next below modal class in value. 
ft = frequency of class next above modal class in value. 
i — class-interval. 
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The interpolation formula is 

Mo = l + X *• 

ft +/i 

Applying this formula to the bond price data presented in 
Table 27, we have 

Mo = 100 + Qjj X lo) = 100 + 3.45 = 103.45. 

A closer approximation may sometimes be secured by bas- 
ing the weights (represented by / 2 and fi) upon the total 
frequencies of the two or three classes next above the 
modal class and the same number below. If three classes 
on each side are included in the present case, a value of 
102 . 8 is secured for the mode of bond prices. 

In some cases the problem of locating the mode is com- 
plicated by the existence of several points of concentration, 
rather than the single point which has been assumed in 
the preceding explanation. Thus in Table 9, representing 
the distribution of wages, with a class-interval of 25 cents, 
there are two definite modal points. A distribution of this 
type is called bi-modal; when plotted, a frequency curve 
having two humps is obtained. If the data are homogene- 
ous such a distribution is the result of paucity of data and 
of the method of classification employed. It may be due 
to the use of a class-interval too small, with respect to the 
number of cases included in the sample. An approximate 
mode may be determined in such cases by shifting the 
class-limits and increasing the class-interval, carrying on 
this process until one modal group is definitely established. 
This reverses the process by which the true mode may be 
located when the number of cases is infinitely large. Under 
such conditions the class-interval might be reduced until 
it was infinitely small. But with a limited number of cases 
the location of the point where the concentration is greatest 
necessitates increasing the size of the class-interval, in order 
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to get away from the irregularities due to the smallness of 
the sample. 

If the distribution remains bi-modal in spite of changes 
in the class-intervals and class-limits, it is probable that 
the data are not homogeneous, that two different distri- 
butions have by mistake been combined. Such cases are 
not uncommon in biometrical work. The existence of two 
distinct animal species where only one was suspected has 
been revealed in this way. The whole significance of a 
frequency distribution will be lost if the data are not homo- 
geneous, a fact which is as true of work in the field of eco- 
nomic statistics as in any other. 

DETERMINATION OF THE MODAL VALUE FROM MEAN 
AND MEDIAN 

Another method of securing an approximate value for 
the mode, a method based upon the relationship between 
the values of the mean, median and mode, may be em- 
ployed in certain cases. In a perfectly symmetrical distri- 
bution mean, median and mode coincide. As the distribu- 
tion departs from symmetry these three points on the scale 
are pulled apart. If the degree of asymmetry is only mod- 
erate the three points have a fairly constant relation. The 
mode and mean lie farthest apart, with the median one 
third of the distance from the mean towards the mode. If 
the asymmetry is marked, no such relationship may pre- 
vail. Having the values of any two of the averages in a 
moderately asymmetrical frequency distribution, therefore, 
the other may be approximated. In fact, however, the 
method should only be employed in determining the value 
of the mode, as the other two values may be computed 
more accurately by other methods. The value of the mode 
itself should only be determined in this way when more 
exact methods are not applicable or are not called for. 

The following formula is based upon this relationship: 

Mo = Mean — 3(Mean — Md). 
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Applying this formula to the telephone pole data shown 
in Table 12, the following result is secured: 

Mo = 9.33 - 3(9.33 - 9.015) = 8.385. 

This value is slightly below the mid-value of the modal 
class, 8 . 5, and is also less than the value 8 . 49 which is se- 
cured by weighting within the modal group (using four classes 
on each side). 

It must be emphasized that there is a fictitious accuracy 
to all these values for the mode. All the methods of locat- 
ing the mode which have been discussed are merely approx- 
imative, a fact not to be forgotten in interpreting and uti- 
lizing the results. 

Graphic Location of Mode, Median, Quartiles, and 

Deciles 

A better understanding of the frequency curve and of 
the cumulative frequency curve may be secured through a 
brief discussion of certain methods of locating graphically 
some of the statistical measures that have been described. 

The value of the mode may be readily determined from 
a frequency curve of the usual type, for, by definition, the 
mode is the reading on the horizontal scale corresponding 
to the maximum ordinate of such a curve. If this reading 
be taken from the frequency polygon a rough value will be 
obtained, the mid-value of the class of greatest frequency. 
A closer approximation to the true value of the mode will 
be secured from a curve which has been smoothed, either 
by inspection or by mathematical methods. Figure 47, 
showing a curve (smoothed by inspection) based upon the 
wage data presented in Table 8, indicates how the mode 
may be located graphically. The horizontal reading corre- 
sponding to the maximum ordinate of this curve is $27 . 50, 
an approximate value of the mode which may be compared 
with the values of $27 . 69 secured by the weighting process 
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and of $27.3470 secured from the values of the mean and 
median. 


The locations of the median and mean have been indi- 
cated on this chart. It has been pointed out that in mod- 
erately asymmetrical (or skewed) distributions there tends 
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Fig. 47. — Distribution of Weekly Earnings of Employees. A Smoothed 
Frequency Curve, showing the Relation between Mean, Median and Mode 

to be a constant relationship between the three averages 
which have been described, the median lying between the 
mean and the mode, and approximately one third of the 
distance from the former towards the latter. In the present 
case this relationship holds fairly well when the value of 
the mode is approximated from the smoothed curve. The 
irregularities in the original data render the process of 
smoothing by inspection rather arbitrary, however. 

In Fig. 48 the same data are represented by a cumulative 
frequency curve, based upon Table 28 on page 124. The steep- 
ness of a cumulative frequency curve within any given inter- 
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val depends upon the number of cases added within the cor- 
responding interval on the horizontal scale. Thus the curve 
rises gradually at first, then more steeply, and tails off 
gradually at the upper extremity. The value of the mode, 
obviously, is the reading on the horizontal scale correspond- 
ing to the point of greatest steepness. This is the point at 



Fig. 48. — Cumulative Distribution of Weekly Earnings of Employees, 
Illustrating the Graphic Location of Median and Quartiles 

which the increase of frequencies is greatest, the point of 
greatest concentration in the frequency distribution. The 
value of the mode may be approximated from a smoothed 
frequency curve by locating the point at which the slope is 
greatest (which is a point of inflection) and taking the corre- 
sponding reading on the x-scale. In the present case a value of 
approximately $27 . 50 is secured for the mode by this method. 

Values for the median, quartiles, and deciles may also be 
secured graphically from the cumulative frequency curve. 
The smoothing of such a curve provides a quite satisfactory 
method of interpolation and, if the scale of the diagram 
is sufficiently large, accurate values may be obtained by 
this method. Locate on the vertical scale (the scale of 
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Table 28 

Cumulative Distribution of Wage-Earners in a Manufacturing 

Establishment 

(Classified on the basis of weekly earnings) 


Weekly earnings 

Number earning 
(Jrequer 

Less than $22 . 50 

0 

u 

tt 

23.00 

1 

a 

tt 

23.50 

5 

ti 

tt 

24 00 

8 

it 

tt 

24.50 

19 

it 

tt 

25.00 

29 

it 

tt 

25.50 

41 

it 

tt 

26.00 

56 

it 

tt 

26.50 

78 

tt 

tt 

27.00 

98 

it 

tt 

27.50 

122 

tt 

tt 

28.00 

152 

it 

tt 

28.50 

169 

tt 

tt 

29 00 

186 

tt 

tt 

29 50 

193 

tt 

tt 

30 00 

199 

tt 

tt 

30.50 

204 

tt 

tt 

31.00 

208 

tt 

tt 

31.50 

209 

it 

tt 

32.00 

209 

tt 

ft 

32.50 

210 


N 

cumulative frequencies) a point distant from the base by — 

A 


If from this point a horizontal line be extended to the cumula- 
tive curve, the abscissa of the point of intersection will be 
the value of the median. This value may be easily deter- 
mined by dropping a vertical line from the point of inter- 
section to the x-axis. Figure 48 illustrates the application 
of this method. A value of $27 . 125 is secured for the median 
by this method. By direct interpolation a value of $27 . 1458 
is obtained. The quartiles may be located in precisely the 
same way, the vertical scale being divided into quarters 
and horizontal lines extended to the cumulative curve from 
the points thus located on the vertical scale. 
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For some purposes, particularly those that involve the 
averaging of rates or ratios rather than quantities, none of 
the averages which have been described is suitable. The 
geometric and the harmonic means are types of averages 
that should be familiar because they are particularly ap- 
propriate for such purposes. 


The Geometric Mean 

The geometric mean is the nth root of the product of n 
measures; its value thus is represented by: 

j Ol- C&2 * ... fln* 

The geometric mean of the numbers 2, 4, 8, is 



It is obvious from the method of computation that if 
any one of the measures in the series has a value of zero the 
geometric mean is zero. 

The actual computation of the geometric mean is greatly 
facilitated by the use of logarithms. In this form 


Log Mg = 


log ai + log ffla + log «3 + . . • + log a„ 
N 


The logarithm of the geometric mean is equal to the arith- 
metic mean of the logarithms of the individual measures. 

When the measures, of which the geometric mean is de- 
sired, are to be weighted, the separate weights are intro- 
duced as exponents of the terms to which they apply. Thus 
if we represent the sum of the weights by N and the weights 
corresponding to the terms Oi, o 2 , a 3 ... a„, respectively, 
by Wi, w it w 3 ... w n , the formula for the geometric mean is 


'a i tt > • ai 0 ' • 


This is equivalent to repeating each term a number of 
times, the number corresponding to the amount by which 
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it is weighted. (This, of course, is precisely what is done 
in securing a weighted arithmetic mean.) When logarithms 
are employed the formula for the weighted geometric mean 
becomes 

, Wi log ai + «>j log a« + tfs log as + . . . + u>„ log a„ 

Log "*» — A 

A method of computing the geometric mean may be 
illustrated with reference to Table 29, which shows the 
distribution of the prices of 66 preferred stocks paying 
seven per cent dividends. The table is based upon closing 
prices on the New York Stock Exchange and the New York 
Curb Exchange for the week ended July 25, 1936. 

Table 29 


Computation of the Geometric Mean of Preferred, Stock Prices 


Class-interval 

m 

/ 

log 771 

/log m 

$ 70-$ 89 9 

80 

5 

1.90309 

9.51545 

90- 109.9 

100 

20 

2 00000 

40.00000 

110- 129 9 

120 

27 

2.07918 

56. 13786 

130- 149 9 

140 

6 

2.14613 

12.87678 

150- 169.9 

160 

8 

2.20412 

17.63296 

Log Mg = 

136.16305 

66 

66 

136.16305 

Log M„ = 2.06308 

M, = 115.63 


CHARACTERISTICS OF THE GEOMETRIC MEAN 

The nature of the geometric mean may be understood 
by considering its relation to the terms it represents, as an 
average. 

If the arithmetic mean of a series of measures replace 
each item in the series, the sum of the measures will remain 
unchanged. Thus, the sum of the numbers 2, 4, 8 is 14. 
The arithmetic mean of these three numbers is 4$ ; if this 
value be inserted in the place of each of the three measures 
the sum remains 14. It is characteristic of the geometrio 
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mean that the product of a series of measures will remain 
unchanged if the geometric mean of those measures replace 
each item in the series. Thus the product of 2, 4, 8 is 64. 
The geometric mean of the three numbers is 4; if this value 
replace each of the three measures the product remains 64. 

Again, it is true of the arithmetic mean that the sum of 
the deviations of the items above the mean equals the 
sum of the deviations of the items below the mean (disre- 
garding signs). The sums of the differences between the 
individual items and the mean arc equal. In the case of 
the geometric mean the products of the corresponding ratios 
are equal. If the ratios of the geometric mean to the meas- 
ures which it exceeds be multiplied together, the product 
will equal that secured by multiplying together the ratios 
to the geometric mean of the measures exceeding it in value. 
For example, the geometric mean of the numbers 3, 6, 8, 9 
is 6. The following equation may be set up: 



= § x ? 
6 X 6 


The last example brings out the most important charac- 
teristic of the geometric mean. It is a means of averaging 
ratios. Its chief use in the field of economic statistics has 
been in connection with index numbers of prices, where 
rates of change are of major importance. A rise in prices 
represented by the change from 50 to 100 is as important 
as a rise from 100 to 200. Yet this equivalence is not brought 
out by the arithmetic mean, which gives double weight 
to the change which involves an absolute difference of 100. 
An example frequently cited is that of two cases of price 
change, one a ten-fold increase, from 100 to 1,000, the other 
a fall to one tenth of the old price, from 100 to 10. The 
arithmetic mean of 1,000 and 10 is 505, the geometric 
mean is V 1,000 X 10, or 100. When the average is of the 
latter type it is seen that the two equal ratios of change 
have balanced each other. The arithmetic mean, 505, is 
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quite incorrect as a measure of average ratio of price change. 
This subject is discussed at greater length in the chapter on 
index numbers. 

What has been said in an earlier section in regard to the 
advantages of logarithmic charting for certain purposes 
bears upon the use of the geometric mean. This average 
is sometimes called the logarithmic mean, as its logarithm 
is simply the arithmetic mean of the logarithms of the 
constituent measures. Wherever percentages of change are 
being averaged, where ratios rather than absolute differ- 
ences are significant, the use of the geometric mean is 
advisable. 

A problem involving the use of the geometric mean arises 
in computing the average rate of increase of any sum at 
compound interest. If p 0 represent the principal at the 
beginning of the period, p n the principal at the end of the 
period, r the rate of interest and n the number of years 
in the period, the sum to which p 0 will amount at the end 
of the » years, if interest is compounded annually, is repre- 
sented by the equation: 

Pn = Po(l + r) n . 

It follows from this that : 



Thus, if $1,000 at compound interest amounts to $1,600 
at the end of 12 years, there has been an increase of 60 per 
cent. The arithmetic mean is 5 per cent, but this is not the 
rate at which the money increased. The true rate is: 



- \0 - 1 
= 1.04-1 
= .04, or 4% 
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Precisely the same problem arises whenever rates of in- 
crease or decrease are to be averaged. The use of the arith- 
metic mean gives an incorrect result. 

THE GEOMETRIC MEAN AS A MEASURE OP CENTRAL 
TENDENCY 

A question arises as to the type of frequency distribution 
the central tendency of which would be best represented 
by the geometric mean. When the absolute measures, 
plotted on the arithmetic scale, give a fairly symmetrical 
distribution, the arithmetic mean is clearly preferable to the 
geometric mean. But when the absolute figures thus plotted 
give an asymmetrical frequency curve of such a type that 
the asymmetry would be removed and a symmetrical curve 
secured by plotting the logarithms of the measures, the 
geometric mean would appear to be preferable. Such a 
distribution would be one in which not the absolute devia- 
tions about the central tendency but the relative deviations, 
the deviations as ratios, were symmetrical. The arithmetic 
mean of the logarithms of the various measures (which 
value is, as has been shown, the logarithm of the geometric 
mean of the original measures) would be the best representa- 
tive of the central tendency in such a distribution. The 
curve thus plotted would be symmetrical about the logarithm 
of the geometric mean. A frequency curve representing 
the logarithms of percentage changes in prices would tend 
to show this symmetry about the logarithm of the geometric 
mean of these changes. These percentage changes, as nat- 
ural numbers, group themselves in an asymmetrical form, 
with the range of deviations above the arithmetic mean 
greatly exceeding the range below. 1 This arises, of course, 
from the fact that prices of given commodities may increase 
1,000 per cent or more from a given base, but cannot fall 
more than 100 per cent from any given base. The section 


1 Cf. Fig. 51. 
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on index numbers contains a fuller discussion of this partic- 
ular phase of the subject. 1 

The construction of a frequency distribution in which loga- 
rithms are tabulated would be laborious, if the logarithm 
of each item to be entered had to be determined, before 
tabulation. It is possible, however, with no great trouble to 
construct a true logarithmic distribution, with class-interval 
constant in terms of logarithms. The 66 quotations on pre- 
ferred stocks, tabulated in Table 29, range from 74 to 166. 
The logarithm of 74 is 1.86923; the logarithm of 166 is 
2.22011. The range, in logarithms, is .35088. We may 
select .06 as a suitable logarithmic class-interval, for the 
present purpose. For convenience in tabulating the data 
we set up two series of class limits, one in terms of logarithms, 
one in terms of the corresponding natural numbers. In 
constructing the distribution natural numbers may be tab- 
ulated, utilizing the class limits defined in natural terms. 
All subsequent calculations may be carried through in terms 
of logarithms. The distribution appears in Table 30 on 
page 131. 

If the geometric mean is considered appropriate for a given 
series, the type of distribution represented by Table 30 is 
more logical than that shown in Table 29, and the descrip- 
tive measurements secured from Table 30 have correspond- 
ingly greater validity. We may derive the mean of the 

1 C. M. Walsh, in The Problem of Estimation (London, P. S. King Son, 
1921) 35, lays down the following criteria for the use of averages: 

(а) When there are no conceivable or assignable upper or lower limits to the 

values of the terms in a series, the arithmetic average should be em- 
ployed. 

(б) When there is a definite lower limit at or above zero and no upper conceiv- 

able or assignable limit, the geometric average should be employed. 
Because this is true of price changes Walsh believes the geometric 
average to be the correct one to use in making index numbers of prices, 
(c) When in practice, or in the nature of things, certain upper and lower limits 
are found to exist and the above criteria cannot be employed, a study of 
the actual dispersion of the data is necessary. In this case, if the mode is 
found nearer to the arithmetic average, that average should be em- 
ployed; if the mode is found nearer to the geometric average, that aver- 
age should be used. 
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Table 30 

Distribution of Prices of Preferred Stocks 
Paying Seven Per Cent Dividends 


Class-interval 

Class-interval 

Mid-point 

Frequency 


{natural numbers) 

(logarithms) 

(logarithms) 




m 

f 

fm 

$ 70 . 80 -$ 81.27 

1 . 85 - 1.9099 

1.88 

2 

3.76 

81 . 28 - 93.32 

1 . 91 - 1.9699 

1.94 

4 

7.76 

93 . 33 - 107.15 

1 . 97 - 2.0299 

2 00 

12 

24.00 

107 . 16 - 123 02 

2 . 03 - 2.0899 

2.06 

30 

61.80 

123 . 03 - 141 24 

2 09 - 2.1499 

2.12 

6 

12.72 

141 . 25 - 162.17 

2 . 15-2 2099 

2.18 

7 

15 26 

162 . 18 - 186.20 

2 . 21 - 2.2699 

2.24 

5 

11.20 




66 

136.50 


logarithms of the preferred stock prices by dividing 2 fm 
of Table 30 (136.50) by 66. The value is 2.06818. The 
anti-log of this is 116.97, which is the geometric mean of 
the distribution. This differs somewhat from the value 
$115.63 secured from Table 29. The difference is due, in 
part, to the use of different class-intervals and class limits 
in the two cases. With a relatively small number of observa- 
tions such differences would be expected to lead to different 
results. Differing assumptions concerning the internal dis- 
tribution of items within the several classes would also 
contribute to a discrepancy between the two results. The 
value obtained from Table 30 is probably a closer approxima- 
tion to the actual geometric mean than that obtained from 
Table 29. 

A frequency curve based upon the logarithms of the 
measures included rather than upon the natural numbers, 
has been employed to advantage in plotting data relating 
to income distribution. When natural numbers are plotted, 
the range of income distribution is so large that it is physi- 
cally impossible to prepare a chart that will reveal the char- 
acteristic features of all sections of the curve. The process 
of plotting on double logarithmic paper (which is, of course, 
equivalent to plotting the logarithms of both a:’s and y’s) 
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meets this difficulty, giving a true impression of the whole 
distribution and the relations between its parts, and, at the 
same time, brings out certain important features that are 
obscured in the natural scale chart. In particular, this 
device appears to smooth into a straight line that part of 
the curve lying above the mode, a fact which led Vilfredo 
Pareto to enunciate what has been known as Pareto’s Law 
concerning income distribution. An intensive study of the 
distribution of income in the United States has led the staff 
of the National Bureau of Economic Research to call into 
question certain conclusions drawn from Pareto’s generaliza- 
tions, though the value of the double logarithmic scale for 
the presentation of income data has been recognized. 

The Harmonic Mean 

The harmonic mean is a type of average capable of 
application only within a restricted field, but which should 
be employed to avoid error in handling certain types of 
data. It must be used in the averaging of time rates and 
it has distinctive advantages in the manipulation of some 
types of price data. The following example will illustrate 
the method of employing this average. 

A given commodity is priced, in three different stores, at 
“four for a dollar,” “five for a dollar” and “twenty for a 
dollar.” The average price per unit is required. The arith- 
metic average of the figures given (4, 5, and 20) is 9|. If 
we take this to be the average number sold per dollar, the 
average price would appear to be $1.00 -s- 9§, or 1048- cents 
each. But the original quotations are equivalent to unit 
prices of 25 cents, 20 cents, and 5 cents; the arithmetic 
average of these prices is 168 cents apiece. The discrepancy 
between 104$ cents and 16$ cents is due to a faulty use 
of the arithmetic mean in averaging quotations in the “so 
many per dollar” form. Such a mean is, in effect, a weighted 
average, with greater weight being given to quotations 
involving a larger number of commodity units. 
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The correct result may be secured by taking the harmonic 
mean of the three original quotations. The harmonic mean 
of a series of numbers is the reciprocal of the arithmetic mean 
of the reciprocals of the individual numbers. Thus if we repre- 
sent the numbers to be averaged by r u r 2 . . . r„, the formula 
for the harmonic mean, H, is 


1_ 

H 


- + - + - + . . • + - 
n ft n r« 

N 


Using the figures just quoted: 

1 4 5 20 

H 3 

= 10 = 1 
60 6 
H = 6. 


The harmonic mean of 4, 5, and 20 is 6, the average number 
of units sold per dollar. The average price per unit is 
16f cents. 

The computation of the harmonic mean of a series of 
magnitudes is greatly facilitated by the use of prepared 
tables of reciprocals. 1 


Relations between Different Averages 

When different averages are located or computed for a 
given series of magnitudes, certain relationships between 
them are found to prevail. 

1. The arithmetic mean, the median and the mode coincide in 

a symmetrical distribution. 

2. In a moderately asymmetrical distribution the median lies 

between the mean and the mode, approximately one third 
of the distance along the scale from the former towards the 

1 Barlow’s Tables of Squares, Cubes, Square Roots, Cube Roots and Reciprocals, 
New York, Spar and Chamberlain. 
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latter. Hence, for this type of distribution there is an ap- 
proximation to the following relationship: 

Mo « M - 3 (M - Md). 

3. The arithmetic mean of any series of magnitudes is greater 

than their geometric mean. 

4. The geometric mean of any series of magnitudes is greater 

than their harmonic mean. The only exception to the last 
two rules is found when all the measures in the series are 
equal, in which case arithmetic mean, geometric mean and 
harmonic mean are equal. 

5. The geometric mean of any two terms is equal to the geometric 

mean of the harmonic and arithmetic means of those terms. 
Thus if the terms be 2 and 8, the harmonic mean is 3£, the 
geometric mean 4, and the arithmetic mean 5. But 4 is also 
the geometric mean of 3£ and 5. This relationship does not 
hold when the series includes more than two terms, unless 
the terms constitute a geometric series. 

6. When the dispersion of data follows the arithmetic law, the 

mode and median will generally be found closer to the 
arithmetic than to the geometric average. When the dis- 
persion follows the geometric law the mode and median will 
generally be found closer to the geometric than to the arithme- 
tic average. 

Characteristic Features of the Chief Averages 
The arithmetic mean 

1. The value of the arithmetic mean is affected by every measure 

in the series. For certain purposes it is too much affected by 
extreme deviations from the average. 

2. The arithmetic mean is easily calculated, and is determinate 

in every case. 

3. The arithmetic mean is a computed average, and hence is 

capable of algebraic manipulation. 

The median 

1. The value of the median is not affected by the magnitude of 

extreme deviations from the average. 

2. The median may be located when the items in a series are not 

capable of quantitative measurement. 

3. The median may be located when the data are incomplete, 

provided that the number and general location of all the 
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cases be known, and that accurate information be available 
concerning the measures near the center of the distribution. 
4. The median is not as well adapted to algebraic manipulation 
as the arithmetic, geometric and harmonic means. 

The mode 

1. The value of the mode is not affected by the magnitude of 

extreme deviations from the average. 

2. The approximate mode is easy to locate but the determination 

of the true mode requires extended calculation. 

3. The mode has no significance unless the distribution includes a 

large number of measures and possesses a distinct central 
tendency. 

4. The mode is the average most typical of the distribution, 

being located at the point of greatest concentration. 

5. The mode is not capable of algebraic manipulation. 

The geometric mean 

1. The geometric mean gives less weight to extreme deviations 

than does the arithmetic mean. 

2. It is strictly determinate in averaging positive values. 

3. The geometric mean is the form of average to be used when rates 

of change or ratios between measures are to be averaged, 
as equal weight is given to equal ratios of change. It is par- 
ticularly well adapted to the averaging of ratios of price 
change. 

4. The geometric mean is capable of algebraic manipulation. 

The harmonic mean 

1. The harmonic mean is adapted to the averaging of time rates 

and certain similar terms. It has been employed in the 
field of economic statistics in the manipulation of price data. 

2. The labor of computing the harmonic mean and its unfamiliarity 

detract from its usefulness in ordinary statistical analysis. 

3. The harmonic mean is capable of algebraic manipulation. 

This summary has been designed to show that each 
type of average has its own particular field of usefulness. 
Each one is best for certain purposes and under certain 
conditions. The characteristics and limitations of each one 
should be understood in order that it may be appropriately 
employed. A complete description of a frequency distribu- 
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tion frequently calls for the determination of two or three 
of the chief averages, as well as other statistical measure- 
ments. The arithmetic mean is perhaps the most useful 
single average. The simplicity of its computation, the 
possibility of employing it in algebraic calculations and 
the fact that its meaning is perfectly definite and familiar 
make it highly serviceable in statistical work. Its sphere 
of usefulness is not universal, however, and it should only 
be employed when the given conditions render it suitable. 
A fuller appreciation of the distinctive virtues of the geo- 
metric mean is leading to a wider employment of that 
measure in many types of statistical work. A discriminat- 
ing use of averages is essential to sound statistical analysis. 
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CHAPTER V 


DESCRIPTION OF THE FREQUENCY 
DISTRIBUTION: MEASURES OF 
VARIATION AND SKEWNESS 

In the preceding chapters we have been concerned, first, 
with methods of reducing a mass of quantitative data to a 
form in which the characteristics of the mass as a whole 
may be readily determined and, in the second place, with 
methods of describing the assembled data. The first ob- 
ject is accomplished with the formation of a frequency 
distribution. The second is partially accomplished when 
there has been obtained a single significant value in the 
form of an average which represents the central tendency 
cf the distribution. But any average, by itself, fails to give 
a complete description of a frequency distribution. Three 
other values are needed before the chief characteristics of a 
given distribution have been measured, and comparison 
with other distributions is possible. The first of these is a 
measure of the degree to which the items included in the 
original distribution depart or vary from the central value, 
the degree of “scatter,” variation or dispersion. The second 
is a measure of the degree of symmetry of the distribution, 
of the balance or lack of balance on the two sides of the 
central value. The third is a measure of kurtosis, of the de- 
gree to which there is a bunching of cases at the modal 
value. The present chapter deals with various measures of 
variation and skewness. The method of measuring kurtosis 
is referred to at a later point. 

Nature and Significance of Variation 

The fact of variation in collections of quantitative data 
has been pointed out in earlier sections and the bearing of 
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t.Viig fact upon the work of the statistician indicated. Prac- 
tically every collection of quantitative data, consisting of 
measurements from the social, biological, or economic field, 
is characterized by variation, by quantitative differences 
among the individual units. And this fact of variation is 
as important as the fact of family resemblance. Biological 
variation has been a fundamental factor in the evolutionary 
process. No measurement of a physical characteristic of a 
racial group, such as height, is complete without an ac- 
companying measure of the average variation in the group 
in this respect. The average income in a country is perhaps 
of less significance than the variation in income, the differ- 
ences between the incomes received by different economic 
classes. Price variations interrupt the normal functioning 
of the economic system, causing hardship to some and 
giving unearned profits to others, because the various ele- 
ments in the price system are unequally affected. Not 
changes in the general level of prices but differences among 
changes in the prices of individual commodities and services 
cause trouble. 

An average, by itself, has little significance unless the 
degree of variation in the given frequency distribution is 
known. If the variation is so great that there is no pro- 
nounced central tendency an average has no significance. 
With a decrease in the degree of variation an average 
becomes increasingly significant. Whether a single fre- 
quency distribution is being described, therefore, or com- 
parison is being made with other distributions, a measure of 
central tendency must be supplemented by a measure of 
variation. 


Measures of Absolute Variation 

Variation may be expressed in terms of the units of 
measurement employed for the original data, or may be 
expressed as an abstract figure, such as a percentage, which 
is independent of the original units. When the original 
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units are employed absolute variability is measured; when 
an abstract figure is secured we have a measure of relative 
variability, more suitable for comparison than the former 
type. Measures of absolute variability are first considered. 

THE RANGE 

A rough measure of variation is afforded by the range, 
which is the absolute difference between the value of the 
smallest item and the value of the greatest item included 
in the distribution. Table 20 in Chapter IV shows the dis- 
tribution of London-New York monthly exchange rates 
during the period 1882-1913. The smallest item among the 
original figures included in the table is $4.83; the greatest 
is $4 . 908. The range, therefore, is $4 . 908-S4 . 83, or $ . 078. 
A distance on the scale equal to $.078 will include every 
item. If the original data were not to be had the range 
could be approximated from the frequency table. It would 
be the difference between the lower limit of the class at the 
lower extreme of the distribution, and the upper limit of 
the class at the upper extreme, or $.085 in the present 
case. 

The value of the range, it is obvious, depends upon the 
values of the two extreme cases only. A single abnormal 
item would change its value materially. Because it is 
erratic and is likely to be unrepresentative of the true 
distribution of items, it is seldom used in statistical work. 
The range is frequently employed as a measure of stock 
market fluctuations, though its adequacy for this purpose 
may be questioned. 

THE MEAN DEVIATION 

A more accurate measure of the dispersion of items about 
a central value is afforded by the simple device of measuring 
the deviation of each item from this central value and aver- 
aging these deviations. The simple example in Table 31 
illustrates the method of < omputation: 
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t* 

& 

6 

9 

12 

15 


Table 31 


Compulation of Mean Deviation 



M = 9 

M.D. = ^ - 3.6 
5 


The average (the mean and median coincide in this case) 
is 9. The deviations are added, taking no account of alge- 
braic signs, and the total divided by the number of items. 
This procedure is described by the expression 


In general terms, the mean deviation of a series of mag- 
nitudes is the arithmetic mean of their deviations from an 
average value (either mean or median). In the process of 
summation and averaging the algebraic signs of the devi- 
ations are disregarded. In practice it makes little differ- 
ence whether deviations be measured from the mean or 
the median. Theoretically the latter should be chosen, for 
the value of the mean deviation is least when the median 
is the point of reference. 

Table 32 illustrates the computation of the mean devi- 
ation when the data are grouped in a frequency distribu- 
tion. 1 In this work, as in certain other computations, we 
make the assumption that the items in each class-interval 
are uniformly distributed throughout that interval. 

The median hourly wage of the 4,216 steel workers repre- 
sented in this distribution is 48. 11 cents. The mean devia- 


1 Since the uses of the mean deviation are somewhat limited, the beginning 
student may well omit the remainder of the section on the mean deviation. 
After study of the more widely employed standard deviation the student may 
wish to return to the computation of the mean deviation of observations 
grouped in a frequency distribution. 



Table 32. Computation of Mean Deviation 
Average Hourly Earnings of Workers in OpenrHearth Furnaces 
in the Great Lakes and Middle West District in 1933 


Class-interval 
(in cent 8 per 
hour) 

Mid- 

point 

Fre- 

quency 

Deviation 

from 

arbitrary 



origin 


m 

f 

d f 

25.0- 29.9 

27.5 

12 

20 

30.0- 34.9 

32.5 

472 

15 

35.0- 39.9 

37.5 

700 

10 

40.0- 44.9 

42.5 

601 

5 

45.0- 49.9 

47.5 

520 

0 

50.0- 54.9 

52.5 

537 

5 

55.0- 59.9 

57.5 

397 

10 

60.0- 64.9 

62.5 

225 

15 

65.0- 69.9 

67.5 

139 

20 

70.0- 74.9 

72.5 

111 

25 

75.0- 79.9 

77.5 

43 

30 

80.0- 84.9 

82.5 

111 

35 

85.0- 89.9 

87.5 

74 

40 

90.0- 94.9 

92.5 

59 

45 

95.0- 99.9 

97.5 

45 

50 

100.0-104.9 

102.5 

51 

55 

105.0-109.9 

107.5 

78 

60 

110.0-114.9 

112.5 

6 

65 

115.0-119.9 

117.5 

17 

70 

120.0-124.9 

122.5 

1 

75 

125.0-129.9 

127.5 

2 

80 

130.0-134.9 

132.5 

5 

85 

135.0-139.9 

137.5 

7 

90 

140.0-144.9 

142.5 

1 

95 

145.0-149.9 

147.5 

1 

100 

150.0-154.9 

152.5 

0 

105 

155.0-159.9 

157.5 

1 

110 



4,216 



Computation of median: 

N 

2 =2,108 

^“ 45 0 + {Ii x < 50 )} 

= 45.0 + 3.11 
= 48.11 


fd' 

240 c = 0.61 

7,080 (Median = 48.11 
7,000 Arbitrary origin =47.5 
3,005 c = 48.11 - 47.5 = 0.61) 

0 

2,685 Na - No. of observations in 
3,970 classes above that 

3,375 containing the median 

2,780 - 1911 

2,775 

1,290 Nb = No. of observations in 
3,885 classes below that 

2,960 containing the median 

2,655 - 1785 

2,250 

2,805 Nm = No. of observations in 
4,680 the class-interval con- 

390 taining the median 

1,190 = 520 

75 

160 i — 5 
425 

630 Calculations 

95 

100 (1) Sum of deviations from 
0 arbitrary origin of all 

110 observations in classes 

56 010 other than that contain- 

ing the median = 56,610 


(2) (AT* - Na)c = - 76.86 



Sum of deviations from median = 56,610 — 76.86 

+ 688.67 
- 57,221.81 


M.D. - 


57,221.81 

4,216 


- 13.573 
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tion could be computed directly, with reference to deviations 
from the median, but it is simpler to measure the deviations 
from the midpoint of the class containing the median, and 
then apply corrections to offset the resulting error. 

In Table 32 deviations have been measured not from 
48.11, the value of the median, but from 47 . 5, the midpoint 
of the class in which the median falls. Working with these 
measurements, the computations involve three steps: 

1. Obtaining the sum of the deviations from the assumed 
n^rlinn of all items falling in classes other than that con- 
taining the true median. 

2. Correcting this sum for the error involved in the use of 
an origin other than the true median. 

3. Adding to the corrected sum the sum of the deviations 
from the median of the items within the class-interval con- 
taining the median. 

(1) The sum referred to in (1) is obtained directly, in the man- 
ner indicated in Table 32. 1 It comes to 56,610. 

(2) The four classes below that containing the median con- 
tain 1,785 items. The deviation of each of these items from the 
true median, 48. 11, is greater by 0-61 than the deviations actu- 
ally recorded in Table 32, which are measured from 47 . 5. The 
measured deviations are too small by 0.61 for 1,785 items. The 
22 classes above that containing the median contain 1,911 items. 
For each of these the deviation from the true mean, 48. 1, is less 
by 0.61 than the deviations actually recorded, which are meas- 
ured from 47-5. The measured deviations are too large by 
0.61 for 1,911 items. Accordingly the figure 56,610, which we 
have obtained as the sum of the deviations from the arbitrary 
origin of all the items in classes other than that containing the 
true median, must be corrected by the addition to it, algebra- 

t ically, of + (1,785 X 0-61) and - (1,911 X 0.61). 

1 This is not the sum of the deviations from 4 . 75, the arbitrary origin. For 
no account is taken of the deviations from that value of the 520 items falling 
within the class in question. If these are scattered uniformly throughout the 
class-interval they will contribute to the total of the deviations from 4.75. 
This would not be so if we were working on the assumption that all the items 
in a class are concentrated at the midpoint. In computing the mean devia- 
tion, however, it is necessary to make a different assumption, namely, that of 
uniform distribution throughout the class-interval. 
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The corrections under point (2) may be defined more briefly. 1 * 
Let N a = number of items in classes above that containing the 
median, Nb = number of items in classes below that containing 
the median, and c = Md — 0, where Md is the median and 
0 is the arbitrary origin. The quantity c will, of course, be 
positive or negative, depending on the relative values of Md 
and 0, and this sign should be retained throughout the calcu- 
lations. The correction noted in (2) is then given by 

(N b - N a )c 

which is to be added (algebraically) to the sum referred to in 
(1). In the present instance we have, as the required correction, 

(1,785 - 1,911) X ( + 0.61) = - 76.86. 

(3) Taking account of point (3) now, we must measure the 
deviations from the median of the 520 observations hitherto 
neglected. These are the observations falling within the class- 
interval that contains the median. This class-interval extends, 
on the x-scale, from 45.0 to 50.0. The value of the median is 
48.11. if the 520 observations are uniformly distributed be- 
tween 45 . 0 and 50 . 0, the number falling between 45 . 0 and 48.11 
may be computed by the direct proportion 

X 520 - 323.4. 

O. U 

Similarly, for the number of observations between 48.11 and 
50. 0, we have 

1.89 

-r-TT X 520 = 196.6. 
o.U 

On the assumption of uniform distribution, the average deviation 
from the median of the 323.4 observations falling between 45.0 

and 48.11 is 1.555 (“•• 3 -f) For the sum of the deviations 

of this group from the median, we have 

323.4 X 1.555 = 502.887. 

Similarly, the average deviation from the median of the 196.6 

( 1 8 

i.e., -~ 

1 Cf. A Handbook of Mathematical Statistics , H. L. Rietz, editor, Boston, 

Houghton Mifflin, 1924, 30. 
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For the sum of the deviations of this group from the median 
we have 

(196.6) X .945 = 185.787. 

The sum of the deviations from the median of all the observa- 
tions in the class containing the median is 

502.887 + 185.787 = 688.674. 

In more general terms, the correction noted in (3) may be 
defined as follows. We have c = Md — 0; let i = class-inter- 
val and let N m = number of observations in the class-interval 
in which the median lies. The sign of c must be retained in the 
calculations. For the number of items in that portion of this 
class-interval which falls below the median, we have 



The average deviation of these items from the median is 



2 


The sum of the deviations from the median of the items in this 
segment of the class-interval containing the median is the 
product of these two quantities, or 



For the number of items in that portion of this class-interval 
which lies above the median, we have 



The average deviation of these items from the median is 


i 



2 
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The sum of the deviations from the median of the items in this 
segment of the class-interval containing the median is given by 



Accordingly, the total correction referred to under (3), on p. 142, 
or the sum of the deviations from the median of the items within 
the class-interval containing the median, is 



The nature of these formulas may be made clearer by insertion 
of the values in the example cited above. 


In the final computation of the mean deviation we must 
apply to the sum referred to under (1), on p. 142, the two cor- 
rections noted under (2) and (3) on p. 142. From (1) we have 
56,610; the correction under (2) is — 76.86; the correction 
under (3) is + 688 . 67. The sum of the deviations from the 
median is, therefore, 57,221 . 81. For the mean deviation from 
the median, we have 


M.D. 


57,221.81 

4,216 


13.573. 


The mean deviation from the mean may be computed by 
an identical process. 


THE STANDARD DEVIATION 

The process of calculating the mean deviation is alge- 
braically illogical because algebraic signs are disregarded. 
In the computation of the standard deviation this error is 
avoided and a measure of more precise mathematical sig- 
nificance is secured. The conventional symbol for the 
standard deviation is the Greek letter sigma, <r. 

In computing this measure the deviations of the indi- 
vidual items from the arithmetic mean are squared, totaled, 
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the mean of the squared deviations obtained, and the square 
root of this mean extracted. The standard deviation is, 
thus, the square root of the mean of the squared deviations. 
This measure is also termed the root-mean-square deviation, 
a useful name because it describes in full the method of 
calculation. The deviations are always measured from the 
arithmetic mean, as the value of the measure is a minimum 
under these conditions. A simple example will illustrate the 
process (Table 33). 


Table 33 



Computation of Standard Deviation 

m 

1 

d d* 


3 

1 

- 6 36 


6 

1 

- 3 9 

M — 9 

9 

I 

0 0 

, /90 

12 

1 

+ 3 9 

a = / VJ 

15 

1 

+ 6 36 

V o 


5 

90 

= V18 




a = 4.24 


When the standard deviation is computed from ungrouped 
data, as in this case, the formula 1 is 



When the items are grouped in a frequency distribution 
the task of computation is a little more complicated. The 
measurement of deviations from an arbitrary origin is essen- 
tial in this case, as it greatly simplifies the calculations. 

1 This formula is used in statistical description , which is the concern of this 
section of the book. If our purpose is to use results secured from a sample as 
estimates of the attributes of the population from which the sample has been 
drawn, a slight modification is desirable. It has been shown that the estimate 
of the true standard deviation is improved if iV— 1 be used as the divisor in the 
formula, in place of N. The difference is slight for estimates based on large 
samples, important for small ones. 
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The general formula for the standard deviation is 



where / represents the class-frequencies, d the deviations 
from the arithmetic mean and N the number of cases 
included. It follows that 


If a deviation from an arbitrary origin be represented by 
d' and the root-mean-square deviation from this origin be 
represented by s a , we have 

. . ssi* 

& ~ N ' 

The root-mean-square deviation from the mean (<r) is less 
than the root-mean-square deviation from any other point 
on the scale. Hence s„ 2 is greater than <r 2 . We may repre- 
sent by c the difference between the true mean and the 
arbitrary origin. It may be readily established 1 that 


ff 2 = So 2 — c 2 . 

The value of the standard deviation may be most easily 
determined, therefore, by computing s 0 2 and c 2 . The opera- 
tions involved are illustrated in detail in Table 34, showing 
the distribution of 11,404 steel workers, classified on the 
basis of average hourly earnings in 1933. 


. „ . 2d 1 

1 For v* — -y- 

but 2d°0 

, a .-W 

,a N 

.\2(d')* - Xd* + Nc * 

d' - d + c 

2<d')* _ 2d* 

N N + 

(d!Y - d} + 2 cd + c* 

Sa 2 - <r 2 + C 2 

2(d')» * 2d 2 + 2 cXd + Nc 2 

<r 2 •= 3* 2 — c 2 r 



Table 34. Computation of Standard Deviation 
Average Hourly Earnings of Workers in Open-Hearth Furnaces 

in 1933 


( 1 ) 

( 2 ) 

(3) 

(4) 


(5) 

( 6 ) 

(7) 

( 8 ) 

Class- 

Mid- 

point 

(cents) 


Deviation 






interval 

Fre- 

from 






( cents per 
hour) 

quency 

arbitrary 

origin 







m 

f 

d' 


fd’ 

m 2 

W'+ 1) 2 

f(d'+l) 2 

15.0- 19.9 

17.5 

41 

- 9 

— 

369 

3,321 

64 

2,624 

20.0- 24.9 

22.5 

54 

- 8 

— 

432 

3,456 

49 

2,646 

25.0- 29.9 

27.5 

342 

- 7 

— 

2,394 

16,758 

36 

12,312 

30.0- 34.9 

32.5 

1,158 

- 6 

— 

6,948 

41,688 

25 

28,950 

35.0- 39.9 

37.5 

2,103 

- 5 

— 

10,515 

52,575 

16 

33,648 

40.0- 44.9 

42.5 

2,063 

- 4 

— 

8,252 

33,008 

9 

18,567 

45.0- 49.9 

47.5 

1,433 

- 3 

— 

4,299 

12,897 

4 

5,732 

50.0- 54.9 

52.5 

1,131 

- 2 

— 

2,262 

4,524 

1 

1,131 

55.0- 59.9 

57.5 

775 

- 1 

— 

775 

775 

0 

0 

60.0- 64.9 

62.5 

478 

0 


0 

0 

1 

478 

65.0- 69.9 

67.5 

457 

1 


457 

457 

4 

1,828 

70.0- 74.9 

72.5 

304 

2 


608 

1,216 

9 

2,736 

75.0- 79.9 

77.5 

216 

3 


648 

1,944 

16 

3,456 

80.0- 84.9 

82.5 

193 

4 


772 

3,088 

25 

4,825 

85.0- 89.9 

87.5 

117 

5 


585 

2,925 

36 

4,212 

90.0- 94.9 

92.5 

111 

6 


666 

3,996 

49 

5,439 

95.0- 99.9 

97.5 

62 - 

7 


434 

3,038 

64 

3,968 

100.0-104.9 

102.5 

71 

8 


568 

4,544 

81 

5,751 

10,300 

105.0-109.9 

107.5 

103 

9 


927 

8,343 

100 

110.0-114.9 

112.5 

34 

10 


340 

3,400 

121 

4,114 

115.0-119.9 

117.5 

58 

11 


638 

7,018 

144 

8,352 

120.0-124.9 

122.5 

27 

12 


324 

3,888 

169 

4,563 

125.0-129.9 

127.5 

19 

13 


247 

3,211 

196 

3,724 

130.0-134.9 

132.5 

19 

14 


266 

3,724 

225 

4,275 

135.0-139.9 

137.5 

14 

15 


210 

3,150 

256 

3,584 

140.0-144.9 

142.5 

12 

16 


192 

3,072 

289 

3,468 

145.0-149.9 

147.5 

2 

17 


34 

578 

324 

648 

150.0-154.9 

152.5 

4 

18 


72 

1,296 

361 

1,444 

155.0-159.9 

157.5 

2 

19 


38 

722 

400 

800 

160.0-164.9 

162.5 

1 

20 


20 

400 

441 

441 

Total 


11,404 


17 

28,200 229,012 


184,016 

N - 11,404 








Class-interval = 

5.0 cents 






c (in class-interval units) 

- 28,200 _ 
11,404 

2.4728 



c 2 (in class-interval units) = + 6.1147 





2/(<n 2 229,012 

8a 2 (in class-interval units) = — 77 — = = 20.0817 

N 11,404 


<r 2 (in class-interval units) = s 0 2 — c 2 ■* 20.0817 — 6.1147 = 13.9670 
a (in class-interval units) — 3.737 

a (in original units) — 3 . 737 X 5 . 0 cents =* 18. 685 cents. 
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The entire calculation, it will be noted, is carried through 
in terms of class-interval units, the result being reduced to 
the original units in the final operation. In computing c, 
the difference between the true mean and the arbitrary 
origin, the algebraic sum of the deviations is divided by the 
number of cases. The arithmetic mean could be deter- 
mined by reducing c to original units and adding this value 
(algebraically) to the value of the arbitrary quantity se- 
lected as origin, but this is not an essential step. The actual 
value of the mean need not be known in the computation of 
the standard deviation. 

A check upon the accuracy of the calculations (the Charlier 
check x ) is afforded by the figures in cols. (7) and (8) of 
Table 34. If deviations be measured, not from the arbitrary 
origin employed in computing the standard deviation, but 
from an origin one class-interval below, we secure a set of 
values equal to d' + 1. The squares of these values are 
given in col. (7). Multiplying by the corresponding fre- 
quencies we have the quantities recorded in col. (8), the 
sum of which is 184,016. This total stands in a definite 
relationship to the values secured in computing the standard 
deviation. For 

W + l) 2 = S/[(d') 2 + 2 d' + 1] 

= 2 f(d'Y + 22 fd! + 2/ 
or 2/(d' + l) 2 = 2 f(d'y + 22/d' + N. 

Inserting in this last equation the values secured from 
the calculations shown in Table 34, we obtain this check: 

184,016 = 229,012 + 2(- 28,200) + 11,404 
= 184,016. 

The following is a summary of the steps in the process of 
computing the standard deviation of items grouped in a 
frequency distribution: 

1 Cl. C. V. L. Charlier, Vtyrlesungen Vber Die GrundzUge Der Maihematuchen 
Statistik, Lund, Verlag Scientia, 1920, 19. 
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1. Select as arbitrary origin the mid-point of a class near the center 

of the distribution. 

2. Measure the deviations from this point of the items in each class, 

in class-interval units. Multiply the deviations by the corre- 
sponding class-frequencies. 

3. Divide the algebraic sum of the deviations by N. This gives c, 

in class-interval units. Compute c s . 

4. Square the deviations and multiply by the corresponding class- 

frequencies. 

5. Divide the sum of the squared deviations by N. This gives s«*, 

in class-interval units. 

6. From the formula, <r 2 = s„* — c 1 , compute a 1 . Extract the 

square root of this value, securing a in class-interval units. 

7. Multiply <r, as thus computed, by the class-interval. The result 

is <r in the original units of measurement. 

Certain of the characteristics of the standard deviation 
and its relation to other measures of dispersion are described 
in a later section 1 

THE QUARTILU DEVIATION 

In the chapter on averages methods of locating the quar- 
tiles and deciles were described. The former are those points 
on the scale of values, along which the items of a given 
distribution lie, which divide the total number of items into 
four equal groups. The deciles are those points dividing the 
total number of items into ten equal groups. The degree 
and character of the variation in a frequency distribution 
may be accurately described if the location of the quartiles 
and deciles is shown. Such knowledge, however, while 
helpful in giving a picture of the distribution, is not as use- 
ful for purposes of concise description and comparison as 
knowledge of the values of the mean deviation or the stand- 
ard deviation. The significance of a single measure is more 
readily grasped than is the meaning of a number of inter- 
related values. Such a measure of variation may be com- 
puted from the quartiles, however. With regard to ease of 

1 A correction to be applied to the standard deviation in certain cases 
(Sheppard’s correction) is described in Chapter XIII. 
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calculation and immediate significance this quartile, deviation 
has distinct merits. 

Within the range between the two quartiles, of course, 
one half of all the measures are included. The greater the 
concentration the smaller this interval, hence a fairly accu- 
rate measure of dispersion may be obtained from the rela- 
tionship between these two quartiles. The quartile deviation 
is the semi-interquartile range, half the distance along the 
scale between the first and third quartiles. Thus if Q.D. 
represent the quartile deviation, Qi the first quartile and 
Q» the third quartile, 

Q.D. = ~ Ql - 

A 

If the value of a point on the scale half-way between 
the first and third quartiles is represented by K, one half 
of all the measures in a frequency distribution will fall 
within the range K ± Q.D. For the data in Table 32, 
relating to the hourly earnings in 1933 of steel workers in 
the Great Lakes and Middle West District, we have 

Q, = 39.07 

Q 3 = 59.03 

Q£> _ 59.03 - 39.07 
2 

= 9.98 

K = 39.07 + 9.98 
= 49.05. 

Thus one half of all the measures lie within the range 
49.05 =fc 9.98. This statement, together with a statement 
of the average hourly earnings in 1933 (mean, median, or 
mode), constitutes a useful description of the distribution. 
In a perfectly symmetrical distribution the value of K will 
coincide with the value of the median (that is, the median 
will lie half-way along the scale from Qi to Q»). The dis- 
tribution of wage rates is slightly asymmetrical, the value of 
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the median being 48.11, as compared with the value of 
49.05 for K. 


THE PROBABLE ERROR 

In studying the results of astronomical and other physi- 
cal measurements it has been found that the values secured 
by different observers for the same constant quantity vary. 
These varying results, however, are distributed in a certain 
definite way, and when plotted give a curve similar to the 
normal curve of error. In such cases there is an immediate 
and obvious need of some measure of variation which may 
be used as an index of the reliability of given results. If 
the results secured by different investigators, or by the 
same investigator at different times, vary widely they can- 
not be accepted as reliable, while the reverse is true if the 
variation is slight. The measure of dispersion which has 
been generally employed in such cases is termed the prob- 
able error. The probable error is that amount which, in a 
given case, is exceeded by the errors of one half the ob- 
servations. Since the most probable value of a given series 
of observations is their arithmetic mean, the probable error 
is always measured from the mean. The name of this 
measure derives from the fact that the probability that a 
given observation will vary from the mean of all the ob- 
servations by an amount greater than the probable error 
is exactly J. It follows that, when the observations are 
arranged in the form of a frequency distribution, a distance 
equal to the probable error laid off on each side of the arith- 
metic mean will define limits within which one half of the 
total number of cases will fall. 

This measure of variation has been employed in fields 
other than that in which it was originally applied, fields in 
which the name probable error is somewhat misleading. In 
such cases it is perhaps better to think of it as the probable 
deviation, that distance from the mean which will be ex- 
ceeded by one half of the total deviations. 
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The probable error is a measure of dispersion which is 
fully significant only when it applies to a distribution fol- 
lowing the normal law of error. In such cases it has a 
definite and precise meaning. This is not so when it is 
applied to skew distributions, and its use in such cases 
is not advisable. The quartile deviation, the value of which 
is equal to that of the probable error in a normal distribu- 
tion, has a more direct significance than the probable error 
in the description of abnormal distributions, and should be 
employed in such cases. In a later section the use of the 
probable error as a measure of the reliability of statistical 
results is more fully explained. 

The value of the probable error in a given case, assuming 
a normal distribution to prevail, may be determined from 
the value of the standard deviation, for there is a constant 
relationship between these two. This is expressed by the 
formula: P.E. = 0.6745 u. 

Relations between Different Measures of 
Variation 

An understanding of the significance of the various meas- 
ures of dispersion described above may be facilitated by a 
general comparison and a summary statement of the rela- 
tions holding between them. 

1. The range is a distance along the scale within which all the 

observations lie. 

2. The quartile deviation or semi-interquartile range is a distance 

along the scale which, when laid off on each side of the point 
mi dway between the two quartiles, includes one half the total 
number of observations. 

3. The mean deviation from the mean, in a normal or slightly 

skew distribution, is equal to about £ of the standard devia- 
tion. A range of 7£ times the mean deviation, centering at the 
mean, will include approximately 99 per cent of all the cases. 

4. When a distance equal to the standard deviation is laid off on 

each side of the mean, in a normal or only slightly skew dis- 
tribution, about two thirds of all the cases will be included: 
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(In the normal distribution about 68.26 per cent of the obser- 
vations will be included.) When a distance equal to twice the 
standard deviation is laid off on each side of the mean approxi- 
mately 95 per cent of the cases will be included (about 95.46 
per cent in a normal distribution). When a distance equal to 
three times the standard deviation is laid off on each side of 
the mean about 99 per cent of all the observations will be 
included (about 99.73 per cent in a normal distribution). 
This general rule that a range of six times the standard devia- 
tion, centering at the mean, will include about 99 per cent of 
all the measures furnishes a useful check upon calculations. 

A study of Fig. 45 may help to make clear the significance of 
the standard deviation in a normal distribution. 

5. The probable error , in a normal distribution, is equal to 0.6745(r. 
A range of twice the probable error, centering at the mean, 
will include 50 per cent of all the observations. A range of 
eight times the probable error, centering at the mean, will 
include approximately 99 per cent of all the observations. 

Characteristic Features of the Chief Measures 
of Variation 

The range 

1. The range is easily calculated and its significance is readily 

understood. As a rough measure of the degree of variation 
the range is useful. 

2. The value of the range is determined by the values of the two 

extreme cases. It is thus a highly unstable measure, the 
value of which may be greatly changed by the addition or 
withdrawal of a single figure. 

3. This measure gives no indication of the character of the distri- 

bution within the two extreme observations. 

The quartile deviation 

1. The quartile deviation is a measure of dispersion that is easily 

computed and readily understood. It is superior to the range 
as a rough measure of variation. 

2. The quartile deviation is not a measure of the variation from 

any specific average. 

3. This measure is not affected by the distribution of the items 

between the first and third quartiles, or by the distribution 
outside the quartiles. The values of the quartile deviation 
might be the same for two quite dissimilar distributions, pro- 
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vided the quartiles happened to coincide. Because it is not 
affected by the deviations of individual items it cannot be 
accepted as an accurate measure of variation. 

4. The quartile deviation is not suited to algebraic treatment. 

The mean deviation 

1. The mean deviation is affected by the value of every observa- 

tion. As the average difference between the individual items 
and the median (or mean) of the distribution it has a precise 
significance. 

2. The mean deviation is less affected by extreme deviations than 

the standard deviation. 

3. Mathematically, the mean deviation is not as logical or as con- 

venient a measure of dispersion as the standard deviation. 

The standard deviation 

1. The standard deviation is affected by the value of every ob- 

servation. 

2. The process of squaring the deviations before adding avoids the 

algebraic fallacy of disregarding signs. 

3. The standard deviation has a definite mathematical meaning 

and is perfectly adapted to algebraic treatment. 

4. The standard deviation is, in general, less affected by fluctua- 

tions of sampling than the other measures of dispersion. 

5. The normal curve of error has been analyzed in terms of the 

standard deviation. The information thus obtained has 
increased greatly the utility of the standard deviation. 

The probable error 

1. The probable error has a definite meaning in the ease of a dis- 

tribution following the normal law. It has not this precise 
meaning for other distributions, and should not be employed 
in describing them. 

2. For distributions to which it is adapted, the probable error is an 

extremely useful measure. Its most important use is as an 
index of the magnitude of errors of sampling. 

3. The definite relationship between the probable error and the 

standard deviation, for a normal distribution, permits the 
value of the probable error to be readily determined. 

All the measures of variation described above may be 
utilized for particular purposes. The standard deviation, 
however, is the best general measure and should be em- 
ployed in all cases where a high degree of accuracy is re- 
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quired. The probable error is, in effect, merely a fractional 
part of the standard deviation, with a definite but restricted 
field of usefulness. 

The Measurement of Relative Variation 

We have been dealing in the preceding section with 
absolute variability. The various measures of dispersion 
secured by the methods outlined describe the variability 
of the data in terms of absolute units of measurement. 
The standard deviation of London-Paris exchange rates is 
in francs, the standard deviation of pig iron production in 
tons, etc. If the object in a given case is the description of 
a single frequency distribution it is desirable that the orig- 
inal unit be employed throughout, but if measures of varia- 
tion of two different distributions are to be compared, diffi- 
culties are encountered. This is clear if the units are unlike, 
but even if the units are identical the same difficulty arises. 
Thus measures of variation in the weights of dogs and in the 
weights of horses might both have been computed in pounds. 
Because the standard deviation of horse weights is greater 
than the standard deviation of dog weights, it does not fol- 
low that the degree of variability is greater in the former 
case. A measure of absolute variation is significant only in 
relation to the average from which the deviations are meas- 
ured. Its use, apart from this average, is meaningless. For 
comparison, therefore, it must be reduced to a relative form, 
and the obvious procedure is to express a given measure 
of variation as a percentage of the average from which the 
deviations have been measured. The quantity thus becomes 
an abstract number, a measure of the relative variability 
of the given observations, and may be compared with similar 
terms computed from other distributions. 

THE COEFFICIENT OF VARIATION 

The measure of relative variation most commonly em- 
ployed is that developed by Pearson, termed the coefficient 
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of variation, and represented by the letter V. It is simply 
the standard deviation as a percentage of the arithmetic 
mean. Thus 

V = £ X 100. 

M 


Applying this formula to the results secured from the an- 
alysis of the distribution of steel workers, classified accord- 
ing to hourly earnings in 1933 (Table 34), we have 


V = 


18.685 

50.136 


X 100 


= 37.27%. 


This measurement may be compared with a similar coeffi- 
cient relating to the distribution of workers in open-hearth 
furnaces, classified according to average hourly earnings in 
1935. In that year the mean wage was 71.946 cents and the 
standard deviation 28.55 cents. From these 


28.55 

71.946 


X 100 


= 39.68%. 


Variations of hourly earnings among steel workers was 
greater in 1935 than in 1933. The difference was not as 
great, however, as a comparison of standard deviations would 
indicate. The average wage advanced appreciably between 
1933 and 1935 and the relative variation increased only 
moderately. 

An index of variability similar to this coefficient might 
be secured by expressing any of the other measures of 
deviation as a percentage of the average from which the 
deviations were computed. Pearson’s coefficient has been 
generally adopted, however, and is the only one in wide use. 


Measures of Skewness 

Methods have been developed in the preceding sections 
for describing the central tendency of a frequency distri- 
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bution and for measuring the degree of concentration or 
lack of concentration about that central tendency. One 
further measure is needed, and that is one which indicates 
the degree of skewness or asymmetry of a given distri- 
bution. For it is essential to know, in regard to a given 
distribution, whether the observations are arranged sym- 
metrically about the central value, or are dispersed in an 
uneven, asymmetrical fashion about that value. Having 
such a figure it will be possible effectively to summarize 
the characteristics of a frequency distribution in three sim- 
ple terms — an average, a measure of dispersion and a 
measure of skewness. There are two measures of skewness 
in current use. 

If a frequency curve is perfectly symmetrical, mean, 
median, and mode will coincide. As the distribution de- 
parts from symmetry these three values are pulled apart, 
the difference between the mean and the mode being great- 
est. This difference may be used, therefore, as a measure 
of skewness. It is desirable in this case, as in measuring 
relative variability, to secure an index in the form of an 
abstract number, which may be compared with similar fig- 
ures derived from other distributions. To this end, Pearson 
has proposed dividing the absolute difference between mean 
and mode by the standard deviation of the given distribu- 
tion. His formula is 


sk (skewness) = 



In a symmetrical distribution, where mean and mode coin- 
cide, the value of this measure will be zero. Under other 
conditions the value may be positive or negative, depending 
upon the relative positions of the two averages on the scale. 

For moderately skew distributions the degree of skew- 
ness may be computed more readily from the formula 

<r 
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This corresponds approximately to the other formula, be- 
cause of the fact that in a moderately asymmetrical distri- 
bution the median lies between the mean and the mode, 
about one third of the distance from the former towards 
the latter. 

Because it is difficult to locate the mode by simple meth- 
ods, a measure of skewness more easily computed than 
Pearson’s is desirable in some cases. Bowley has proposed 
such a method, based upon the relationship between the 
first and third quartiles and the median. If the distribution 
is symmetrical these two quartiles will be equidistant from 
the median; with an asymmetrical distribution this is not 
so. Therefore, if we let q t represent the difference between 
the upper quartile and the median and qi represent the 
difference between the median and the lower quartile, we 
may use the formula 

sk = 9lHlL. 

qt + qi 

as a means of securing a measure of skewness. This value 
will vary between 0 and ± 1. For with perfect symmetry 
qt = q\, and the measure is 0; with asymmetry so pro- 
nounced that the median and one of the quartiles coincide, 
either q 2 or qi becomes equal to 0, and the formula gives 
a value of + 1 or — 1. Bowley suggests that a value of 0.1 
indicates a moderate degree of skewness, while a value of 
0.3 indicates marked skewness. 

The values secured from this measure are not, of course, 
comparable with the values secured from the application 
of Pearson’s formula for measuring skewness. 

KTJRT08IS 

Reference has been made to a fourth measurable char- 
acteristic of frequency curves. This is the degree of flat- 
toppedness, as compared with the normal curve. A measure 
of kurtosis, the technical term for this characteristic, is 
given in Chapter XIII. 
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CHAPTER VI 


INDEX NUMBERS OF PRICES 
The Nature of Index Numbers 

The term “index number” has been applied to a number 
of somewhat similar devices employed in che analysis of 
statistical series. Index numbers have been most widely 
used in the study of price changes, but a brief considera- 
tion of certain other uses may make clear the essential 
characteristics of such measures. In its simplest form this 
name is applied to a term in a time series expressed as a 
relative number. Thus an index number of cotton consump- 
tion in the United States might take the following form: 

Table 35 


Domestic Cotton Consumption in the United States, 
1926-1936 

(Consumption in year ended July 31, 1926 = 100) 


Year ended 
July 31 

Cotton consumption 
(unit: one thousand 
running bales) 

Cotton consumption 

relative 

1926 

6,456 

100 

1927 

7,190 

111 

1928 

6,834 

106 

1929 

7,091 

110 

1930 

6,106 

95 

1931 

5,263 

82 

1932 

4,866 

75 

1933 

6,137 

95 

1934 

5,700 

88 

1935 

5,361 

83 

1936 

6,351 

98 


Similarly the price of a commodity may be expressed as 
a relative, the price at a given date or for a given period 
serving as base. 
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Table 36 

Average Price of No. 1 Northern Spring Wheat, Minneapoli* 
1913, 1929-1936 

(Average price in year ended June 30, 1913 «* 100) 


Crop 

Weighted average 

Relative 

year 

price per bushel 

price 

1913 

$0,874 

100 

1929 

1.276 

146 

1930 

0.984 

113 

1931 

0.739 

85 

1932 

0.605 

69 

1933 

0.770 

88 

1934 

1.026 

117 

1935 

1.165 

133 

1936 

1.247 

143 


The representation of the terms in a time series as rela- 
tives, with reference to a fixed base, makes possible a ready 
comparison of the values for different dates and enables one 
to follow the trend of the series much more easily than 
when the data are presented in their original form. Compari- 
son of the trends of different series is also facilitated. 

Though the term index number has been applied to such 
relatives it is better practice to reserve the term for figures 
which represent the combination of a number of series. 
The series to be combined may relate to prices, production, 
consumption, wages, volume of trade, or to any factor sub- 
ject to temporal variation. (Index numbers have been used 
also in measuring such geographical differences as arise 
from variations in living costs from city to city or from 
country to country.) Quite complex problems may be in- 
volved in the construction of any one of these special forms 
of index numbers, but the essential aim in all cases is to 
secure a single, simple series that will define the net resultants 
of the changes occurring in the constituent elements. 

A simple index number may be constructed to represent 
the course of coal and petroleum production in the United 
States. In the making of such an index it is necessary to 
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combine in some way production figures for bituminous 
and anthracite coal and petroleum. The production figures 
and the corresponding relatives for the three series, from 
1922 to 1936, are given in Table 37. 


Table 37 


Production of Bituminous and Anthracite Coal and Petroleum 
in the United States, 1922-1936 
(Production in 1922 = 100) 



Prod . of 


Prod . of 


Prod, of 


Year 

bit . coal 
(: million 

Rel. 

anthr . coal 
{million 

Rel. 

petrol. 

(million 

Rel. 


sh. tons) 


sh. tons) 


bbls.) 


1922 

422.3 

100 

54.7 

100 

557.5 

100 

1923 

564.6 

134 

93.3 

171 

732 A 

131 

1924 

483.7 

115 

87.9 

161 

713.9 

128 

1925 

520.1 

123 

61.8 

113 

763.7 

137 

1926 

573.4 

136 

84.4 

154 

770.9 

138 

1927 

517.8 

123 

80.1 

146 

901.1 

162 

1928 

500.7 

119 

75.3 

138 

901.5 

162 

1929 

535.0 

127 

73.8 

135 

1,007.3 

181 

1930 

467.5 

111 

69.4 

127 

898.0 

161 

1931 

382.1 

90 

59.6 

109 

851.1 

153 

1932 

309.7 

73 

49.9 

91 

785.2 

141 

1933 

333.6 

79 

49.5 

90 

905.7 

162 

1934 

359.4 

85 

57.2 

105 

908.1 

163 

1935 

372.4 

88 

52.2 

95 

996.6 

179 

1936 

434.1 

103 

54.8 

100 

1,098.5 

197 


A rough index of fuel production, based upon these three 
series, is desired. It is impossible, obviously, to add the 
original figures, as the units are not the same. This diffi- 
culty may be avoided by using the relative figures. A simple 
average of the three relatives for a given year may serve 
as the required index. Index numbers thus secured are 
given in Table 38 on page 164. 

In securing this index, by adding the three relative fig- 
ures for a given year and dividing by three, equal weight 
has been given to each of the three series. Such an index of 
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Table 38 

Index Numbers of Coal and Petroleum Production in the 
United States. 1922-1936 


(Production i: 


Year 

Index 

1922 

100 

1923 

145 

1924 

135 

1925 

124 

1926 

143 

1927 

144 

1928 

140 

1929 

148 


1922 - 100) 


Year 

Index 

1930 

133 

1931 

117 

1932 

102 

1933 

110 

1934 

118 

1935 

121 

1936 

133 


equally weighted relatives has been termed an unweighted 
index, but the term is misleading. Weights are used, the 
weights in this case being equal. It is clear that this index 
based upon equal weights does not reflect faithfully the 
three series combined in the present instance. For the three 
series are not of equal importance, as the system of equal 
weights assumes. The following figures showing the whole- 
sale values in exchange in 1926 of bituminous coal, anthra- 
cite coal, and crude petroleum indicate the relative im- 
portance of the three series: 1 


Mineral 

Bituminous coal 
Anthracite coal 
Petroleum 


Wholesale value in 
exchange in 1926 

$2,157,740,000 

888,141,000 

1,355,989,000 


Roughly, these stand to one another in the relation of 
5, 2, and 3, and these weights may be assigned to the series 
under consideration. An index for each year may be com- 
puted, using these weights. The example in Table 39, 
showing the calculations for the years 1922 and 1923, will 
illustrate the method. 


1 The figures have been compiled by the U. S. Bureau of Labor Statistics. 
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Table 39 


Computation of Weighted Index Numbers of Coal 
and Petroleum Production 


Mineral 

Relative 

production 

1922 

Wt. 

Wt . X Rel. 

Relative 

production 

1923 

Wt. 

Wt. X Rel. 

Bituminous coal 

100 

5 

500 

134 

5 

670 

Anthracite coal 

100 

2 

200 

171 

2 

342 

Petroleum 

100 

3 

300 

131 

3 

393 



10 

1,000 


10 

1,405 


Index of fuel production, 1922 = 1,000 4- 10 — 100 
Index of fuel production, 1923 = 1,405 - 4 - 10 = 141 


The value of the index thus secured for each of the fif- 
teen years covered is shown in Table 40. 


Table 40 

Weighted Index Numbers of Coal and Petroleum 


Production in the United States, 1922-1936 


Year 

Index 

Year 

Index 

1922 

100 

1930 

129 

1923 

141 

1931 

113 

1924 

128 

1932 

97 

1925 

125 

1933 

106 

1926 

140 

1934 

112 

1927 

139 

1935 

117 

1928 

136 

1936 

131 

1929 

145 




Differences between the two series of index numbers are 
to be expected. The second series, which is the more log- 
ically weighted, is, of course, the more accurate of the two, 
and gives a more faithful representation of the combined 
effect of the forces affecting the output of coal and petroleum. 

Another type of index number is one in which the items 
in the constituent series are totaled, the aggregate figure, 
instead of an average, serving as the representative of the 
entire group. Such a form of index number may be con- 







160 INDEX NUMBERS OF PRICES 


structed only when the different series are all expressed in 
the same unit. This form is frequently employed as an 
indication of changes in the level of prices, the aggregate 
cost of a bill of goods at one period being compared with 
the aggregate cost of the same goods at other dates. The 
figures in Table 41 illustrate this type of index. 


Table 41 


Bradstreet’s Index of Wholesale Prices in the 
United States, 1926-1937 1 


Year 

Index 

1926 

13.02 

1927 

12.78 

1928 

13.28 

1929 

12.67 

1930 

10.75 

1931 

8.76 


Year 

Index 

1932 

7.10 

1933 

7.86 

1934 

9.22 

1935 

9.92 

1936 

10.10 

1937 

11.06 


Each of the yearly aggregates quoted above is the sum 
of the average prices during the year of 96 commodities at 
wholesale. Before being added all the prices are reduced to 
the “per pound” basis, so that a certain degree of compara- 
bility is secured. Such an index may be readily changed to 
the relative form, any year being taken as a base and the 
totals for the other years expressed as percentages of the 
figure for the base year. 

The examples which have been given will indicate some 
of the many forms which index numbers may take. The 
term may refer to a simple relative number; it may be 
applied to an average of relative terms, or to an aggregate 
of relative or absolute figures. In all the examples given 
the index has been designed to serve as a measure of change 
over a period, as an indicator of changes in the values of 
time series. The term may have a much broader meaning 
than this. An index of the ability of salesmen might be 
constructed by giving numerical values to the factors deter- 

1 Construction of this index was discontinued at the end of 1937. 
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mining their usefulness and securing an average of these 
values. An index of the efficiency of different departments 
in a business enterprise might be constructed. In any case, 
the construction of an index involves the reduction to com- 
parable terms of a number of different factors and the 
replacement of these several terms by a single figure which 
may serve as their representative. Comparison is thus 
facilitated, whether it be comparison over time or space, or 
comparison with other indices secured by averaging terms 
relating to a similar unit. In all its forms (except the first 
limited and exceptional meaning in which it applies to a 
simple relative) an index number is thus a type of statistical 
average, and such numbers, in their construction and use, 
are subject to all the rules and limitations set forth in the 
development of the subject of averages. 

In the present work we are interested only in the applica- 
tion of the index number device to time series. So varied, 
however, are the rules and practices relating to its applica- 
tion to different types of time series that certain of these 
types must be treated separately. Our first concern is with 
index numbers of wholesale prices. 

Price Changes 

When price movements are surveyed in detail it is diffi- 
cult to perceive order, or any definite trend. We find a mul- 
tiplicity of conflicting movements. The price quotations 
in Table 42 (on page 168), taken at random, are roughly 
typical of what would be found were the entire field of prices 
canvassed in order to compare price movements from month 
to month. 

Of the sixteen commodities listed, five showed no price 
change at all between October and November, 1937, two 
showed price increases, and in nine cases prices declined. 
Some of the price movements were inconsiderable, while 
some marked very material changes. Such, as seen here 
in miniature, is what happens in the price system as a whole. 
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Table 42 


Commodity Prices at Wholesale 1 




Price 

Price 

Commodity 

Unit 

(wholesale) 

(wholesale) 



October, 

November, 



1937 

1937 

Brick, common building, aver- 




age of yard prices 

1,000 

$12,113 

$12. 113 

Pig iron, basic, Valley furnace 

Gross ton 

23.500 

23.500 

Cement, Portland, average of 




plant prices 

Bbl. 

1.667 

1.667 

Linseed oil, raw, N. Y. 

Pound 

.110 

.106 

Steel billets, rerolling, Pitts. 

Gross ton 

37.000 

37.000 

Steel, scrap, Chicago 

Gross ton 

14.688 

12.500 

Copper, electrol., refinery 

Pound 

.119 

.108 

Lead, pig, N. Y. 

Pound 

.058 

.051 

Zinc, pig, N. Y. 

Pound 

.065 

.060 

Coal, anthr., chestnut, average 




of 15 price series, on tracks, 




destination 

Net ton 

9.472 

9.610 

Coal, bit., mine run, average of 




27 price series, on tracks, des- 




tination 

Net ton 

4.305 

4.303 

Crude petroleum, Penn., at wells Bbl. 

2.413 

2.350 

Gasoline, motor, California, re- 




finery 

Gal. 

.083 

.085 

Cotton, middling, N. 0. 

Pound 

.083 

.080 

Wheat, no. 2 red winter, Chi. 

Bu. 

1.033 

.951 

Sugar, granulated, N. Y. 

Pound 

.048 

.048 


All prices do not, with absolute uniformity, move up or 
down or remain constant. Each of the thousands of com- 
modities traded in on the markets of any country, or of the 
world, moves in its own individual way, subject to a variety 
of influences. Yet it does not act in isolation. In its price 
movements it affects other commodities, and is affected 
by them. And, in addition to the forces peculiar to each 
commodity, there are broad forces which act throughout 
the price system, affecting all commodities. It is the busi- 

1 As compiled by the U. S. Bureau of Labor Statistics. 
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ness of the economic statistician to bring order out of the 
chaos of price movements taking place at any given time 
and, out of the multiplicity of minor movements, to pick 
the broad trends which affect the whole economic sys- 
tem. 

The forces bringing about the price movements that are 
to be studied are numerous and complicated, but some 
general conclusions may be drawn with regard to them. 
There are, in the first place, all those changes in production 
and consumption conditions peculiar to individual commodi- 
ties and affecting directly the prices of those commodities. 
The opening of new fields, improvements in production 
technique in individual cases, changes in fashion and the 
transfer of demand from some commodities to others, changes 
in demand and supply with the seasons — all these are 
causing constant price readjustments. These are the changes 
which in ordinary times are most obvious, which are brought 
home directly to the individual merchant or consumer. 
Such changes affect the whole price system, as has been 
pointed out, but not in general by causing upward or down- 
ward movements in the system as a whole. 

These general movements are due to forces that are 
broader in their scope. The general improvement in pro- 
duction technique and the increase in the productivity of 
human labor which has resulted have, by increasing the 
supply of commodities available for consumption, affected 
prices. Changes in monetary systems and, in particular, 
changes in the gold supply have exerted a direct and imme- 
diate influence upon prices, by affecting the supply of money 
in circulation. Similar in character have been changes in 
banking and credit systems and changes in commercial 
practice that have affected the use of credit instruments 
and the rapidity of circulation of money and credits. All 
these forces influence prices, though their incidence is not 
so specific as are those of the factors affecting individual com- 
modities directly. 
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Purpose of General Index Numbers of 
Wholesale Prices 

These separate forces cannot be isolated and evaluated. 
Their joint action causes a perplexing variety of price 
changes. In studying these changes the problem might be 
approached from several different points of view. It mighl 
be desired to study the readjustments that take place within 
the price system, to determine the nature and degree of the 
shifts within the system that come with changing conditions. 
Such a study would yield valuable information as to the 
behavior of prices and the character of their interrelations. 
Our immediate problem, however, is the determination of 
the net resultant of all these forces. Do all price movements 
cancel each other so that while some prices move up and 
some down there is no net change? Or is there at a given 
time a preponderance of movements in one direction, causing 
the level of general prices to move upward or downward? 
If there is such a trend, what is it, and how may it be meas- 
ured? Are the statistical methods that have been explained 
in the earlier sections applicable to the solution of this 
problem? 

The first step in this study involves the answering of the 
last question asked. It has been brought out that methods 
of summarizing quantitative data have been developed, 
but that these methods are applicable only when certain 
conditions are fulfilled. An average, it was noted, has no 
significance unless it represents a distinct central tendency 
in a mass of homogeneous data. Moreover, the type of 
average to be employed depends upon the character of the 
distribution it is to represent. Until the distribution of 
the original data is studied no average or other statistical 
measure can be intelligently employed. We must first, 
then, determine what the raw materials of the problem are, 
and study the frequency distributions secured when these 
raw materials are organized. 
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For the present a quite general purpose will be assumed, 
the d;termination of the change in the level of general 
wholesale prices between two specific dates. This is equiva- 
lent, of course, to measuring the change in the purchasing 
power of money in wholesale markets. The raw materials 
of the problem consist of a number of price quotations on 
individual commodities, quotations being secured for the 
two dates to be compared. Each pair of quotations meas- 
ures the change in the price of a single commodity, a change 
caused by the interplay of many forces. When a great 
many such price quotations are brought together we have 
a mass of data representing the interaction of a multitude 
of forces, some individual and specific in their incidence, 
some general, affecting the prices of large groups of com- 
modities or of all commodities. What we seek to determine 
is the net resultant of all these factors. We seek a measure 
of the composite effect of the numerous forces that are 
causing individual prices to rise or fall. This measure will 
constitute an index number of wholesale prices. 

The unit with which we must deal is a single price varia- 
tion. Whether the statistical methods with which we are 
familiar may be employed in the organization and analysis 
of a number of such units depends upon the behavior of 
such units in mass. The following examples illustrate the 
frequency distributions secured when these data are clas- 
sified. 


Frequency Distributions of Price Ratios 

Each price variation is, of course, a ratio, the ratio of the 
price of a commodity at a given date to the price of the 
commodity at another date. The ratios may be reduced 
to a comparable basis by putting them all in the form of 
relatives, of the type illustrated in the earlier examples of 
index numbers. Thus, using one of the pairs of price quo- 
tations given above, the ratio of the price of steel scrap 
in November, 1937, to the price in October, 1937, is 
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$12,500: $14,688, which, in the form of a relative, becomes 
85.1: 100. In constructing the following frequency table, the 
prices at wholesale in 1927 of 670 commodities were ex- 
pressed as relatives, with the 1926 price as a base in each 
case. The distribution of these 670 relative numbers is 
shown in Table 43. 


Table 43 


Distribution of the Relative Prices of 670 Commodities in 1927 1 



(Average prices in 1926 =» 100) 



Mid-point 

No. of cases 

Percentage of total 

Relative prices 

m 

s 

number of cases 

52.5 - 57.4 

55 

1 

.1 

57.5- 62.4 

60 

2 

.3 

62.5- 67.4 

65 

6 

.9 

67.5- 72.4 

70 

7 

1.0 

72.5- 77.4 

75 

8 

1.2 

77.5- 82.4 

80 

25 

3.7 

82.5- 87.4 

85 

50 

7.5 

87.5- 92.4 

90 

76 

11.3 

92.5- 97.4 

95 

136 

20.3 

97.5-102.4 

100 

196 

29.3 

102.5-107 4 

105 

83 

12 4 

107.5-112 4 

110 

26 

3.9 

112.5-117.4 

115 

16 

2.4 

117.5-122 4 

120 

14 

2.1 

122.5-127.4 

125 

12 

1.8 

127.5-132.4 

130 

2 

.3 

132.5-137.4 

135 

3 

.5 

137.5-142.4 

140 

5 

.8 

142.5-147.4 

145 

1 

.1 

147 5-152.4 

150 



152.5-157.4 

155 

1 

.1 



670 

100.0 


The frequency polygon representing this distribution ap- 
pears in Fig. 49. For purposes of comparison with similar 
distributions the figure shows the percentage distribution. 

1 The 670 commodities included were those employed by the U. S. Bureau 
of Labor Statistics in the construction of its index of wholesale prices. The 
original figures, and the relatives, appear in Bulletin 478 , of that Bureau, on 
11 Wholesale Prices, 1913-1927.” 
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The correspondence of this frequency distribution to the 
standard types portrayed in earlier sections is obvious. 
There is the same marked concentration about a central 
tendency, in this case a tendency of prices to remain stable, 
for 29 per cent of all the cases showed a change not exceed- 
ing 2 . 5 per cent from their prices in the base year. There 
is also, in this case, a fairly symmetrical distribution about 
this central tendency, though the range above the mode is 
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Relative Price 

Fig. 49. — Frequency Polygon: Distribution of Relative Prices of 
670 Commodities in 1927 (Average prices in 1926 = 100) 

slightly greater than the range below. Without at present 
considering the question as to which average might best be 
used to represent the central tendency in this distribution, 
it is apparent that the use of some average is quite legitimate. 

The example just given has been based upon price varia- 
tions from one year to the next, over a period during which 
the level of general prices declined slightly (4.6 per cent). 
W. C. Mitchell gives a much more comprehensive illustra- 
tion, based upon the distribution of 5,578 price variations 
from one year to the next over the period 1890-1913, which 
shows the same general grouping. The excess of the range 
above the mode over the range below is somewhat more 
pronounced, in connection with which fact it should be 
noted that prices were rising during most of the 23 years 
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covered. The distribution secured by Mitchell is shown in 
Fig. 42. 

The inertia of prices is most conspicuous when year-to- 
year price changes are studied. It is therefore advisable to 
consider the character of price variations over a longer 
period, that we may learn whether the same type of dis- 
tribution is secured. Two examples are given, one of price 
changes over a seven-year period, marked by a considerable 



Fig. 50. — Frequency Polygon : Distribution of Relative Prices of 
774 Commodities in 1933 (Average prices in 1926 = 100) 

decline in prices, the other of price changes over a five-year 
period characterized by rapidly rising prices. The table 
following shows the distribution of 774 price variations, 
prices in 1933 being expressed as relatives on a 1926 base. 
The general level of wholesale prices, it should be noted, 
declined some 33 per cent from 1926 to 1933. 

The data in Table 44 are plotted in the form of a frequency 
polygon in Fig. 50, the percentage distribution being shown. 
It will be noted that the distribution is curtailed, the five 
upper classes being omitted. 

The distributions depicted in Figs. 49 and 50 differ mar 
terially. The range of the variations is greater in the second 
case, a condition naturally to be expected because of the 
longer period covered. Secondly, a very much smaller per- 
centage of cases is concentrated in the modal group, though 
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Table 44 


Distribution of Relative Prices of 774 Commodities in 1933 

(Average prices in 1926=100) 


Relative prices 


Mid-point No. of cases Percentage of total 
m f number of cases 


10- 14.9 

12.5 

3 

.4 

15- 19.9 

17.5 



20- 24.9 

22.5 

1 

.1 

25- 29.9 

27.5 

7 

.9 

30- 34.9 

32.5 

13 

1.7 

35- 39.9 

37.5 

24 

3.1 

40- 44.9 

42.5 

28 

3.6 

45- 49.9 

47.5 

51 

6.6 

50- 54.9 

52.5 

49 

6.3 

55- 59.9 

57.5 

50 

6.5 

60- 64.9 

62.5 

62 

8.0 

65- 69.9 

67.5 

58 

7.5 

70- 74.9 

72.5 

93 

12.0 

75- 79.9 

77.5 

81 

10.5 

80- 84.9 

82.5 

62 

8.0 

85- 89.9 

87.5 

67 

8.7 

90- 94.9 

92.5 

40 

5.2 

95- 99.9 

97.5 

27 

3.5 

100-104.9 

102.5 

27 

3.5 

105-109.9 

107.5 

11 

1.4 

110-114.9 

112.5 

6 

.8 

115-119.9 

117.5 

8 

1.0 

120-124.9 

122.5 

1 

.1 

125-129.9 

127.5 

2 

.3 

155-159.9 

157.5 

1 

.1 

180-184.9 

182.5 

1 

.1 

190-194.9 

192.5 

1 

.1 



774 

100.0 


there is still a pronounced central tendency. Both distribu- 
tions, as plotted on the arithmetic scale, are fairly symmetri- 
cal, though a few extreme cases extend the actual upper limit 
of the second distribution. In Fig. 49 the concentration about 
the central tendency is much more marked, and the devia- 
tions of individual price ratios from the central tendency 
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are smaller. This distribution resembles one which would 
be secured from highly accurate physical measurements, or 
the distribution of shots from a very accurate piece of artil- 
lery. The second curve corresponds to one representing less 
accurate physical measurements, or to the distribution of 
shots from an old or inaccurate field piece. The modal 
value occurs less frequently and the deviations from the 
central tendency are greater. It has been established that 
the longer the period covered in price comparisons such as 
those made above, the more pronounced is the tendency 
shown in the second curve. The value of the maximum 
ordinate falls and the range of the distribution increases. 
The curve becomes flatter and more extended as the time 
interval increases. And, quite obviously, as this process 
goes on the representative character of any type of average 
declines. Unless there is concentration about a central 
tendency an average is merely an abstraction, without con- 
crete significance. 

It is possible at this point to state as a tentative conclu- 
sion that price variations are capable of statistical measure- 
ment, that they may be represented appropriately by an 
average value, provided the period covered is not too long. 
No definite statement can be made as to the maximum 
period over which price variations may be measured. Index 
numbers having accurate and significant values must be 
based upon comparisons over relatively short periods, the 
most accurate being year-to-year comparisons. Index num- 
bers designed merely to show general trends in prices may 
cover longer periods, though the makers and users of such 
index numbers should realize their limitations. 1 

As a final example we may note the distribution of the 
relative prices of 1,437 commodities in 1918, average prices 
during the period July, 1913 to June, 1914 serving as base. 2 

1 Cf. W. C. Mitchell, “The Making and Using of Index Numbers,” Bulle- 
tin 884 (Wholesale Price Series), U. S. Bureau of Labor Statistics. 

’Data compiled by the Price Section of the War Industries Board; repro- 
duced in Part I, Bulletin 284, U. S. Bureau of Labor Statistics, 70. 
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This was a period marked by rapidly rising prices. In con- 
sulting the graph (Fig. 51) it should be noted that the scales 
are not the same as those employed in the two figures pre- 
ceding. 

A study of this distribution bears out the conclusion 
reached from the two examples preceding. There is a central 
tendency sufficiently pronounced to be well represented by 
an average. In this case, moreover, the modal group is 



Relative Price 


Fig. 51. — Frequency Polygon: Distribution of Relative Prices of 1,437 
Commodities in 1918 (Average prices July 1913 to June 1914 = 100) 

that with a mid-point of 180, so that the tendency toward 
concentration cannot be attributed to inertia, but to the 
presence of external forces affecting the price system as a 
whole. There is, however, one marked point of difference 
between this distribution and the two others. The tendency 
toward skewness, which was in evidence in the first example, 
is pronounced in this case. The curve, as plotted on the 
arithmetic scale, is markedly asymmetrical. The greatest 
concentration is near the lower limit of the scale and a long 
tail, extending in fact far beyond the limit of the chart, 
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tapers out to the right. The highest relative price, indeed, 
is 3,009, representing an increase of 2,909 points. The small- 
est relative price, in comparison, is 36, representing a decline 
of 64 points on the scale. 

A price increase, expressed as a relative, has no upper 
limit. An increase of 100, 500, 1,000 per cent or more is 
conceivable and possible. The greatest price increase noted 
by the War Industries Board in its study of prices during 
the war was one of 4,981 per cent, in the case of acetipheneti- 
din. But 100 per cent is the maximum decline possible, as 
that would mean that the price of a commodity had fallen 
to zero. This is the explanation of the skewness noted in 
the curves shown. When any considerable number of price 
ratios are tabulated the corresponding frequency curve, 
plotted on an arithmetic scale, shows this characteristic 
feature, a feature which is most conspicuous during a period 
of rising prices. 

The argument developed in the preceding pages may be 
briefly summarized. Before discussing the practice of index 
number construction it was considered advisable to study 
the character of the raw materials and the nature of the 
distributions secured when these materials are brought to- 
gether, in order to determine whether ordinary statistical 
methods are appropriate. The raw materials, we have seen, 
consist of individual price variations, expressed as ratios. 
When a number of these ratios are assembled a frequency 
distribution is secured which somewhat resembles the dis- 
tribution of data following the normal law of error. A 
central tendency, which may legitimately be represented 
by an average, is apparent in the distribution of price varia- 
tions. The central tendency is less marked, however, and 
the deviations from it are more pronounced, the longer the 
period covered in the price comparison, so that an average 
becomes less representative as this period increases. In 
addition, a tendency toward skewness has been noted, and 
this was seen to be quite pronounced in a period of rising 
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prices. This skewness is due to the fact that we are dealing 
with ratios that have a definite lower limit and no upper 
limit. 

Variety of Methods Employed in Index Number 
Construction 

Many methods have been and are being employed in the 
construction of index numbers of wholesale prices. Usage 
varies for many reasons. There are differences of opinion 
as to which is theoretically the best method. There are 
practical difficulties to be surmounted, difficulties which 
inevitably cause differences in practice because of the vary- 
ing resources of the agencies engaged in these tasks. And 
there are, finally, differences due to the varying purposes 
for which index numbers are constructed, the varying ques- 
tions they are designed to answer. 

Prevailing differences in practice and differences in the 
results secured by the employment of various methods in 
the construction of index numbers can perhaps be illus- 
trated most effectively by the application of a number of 
methods to the same data. Table 45, on the preceding page, 
presents the raw material to which these various methods are 
to be applied — the average farm prices, on December 1, of 
twelve leading crops, from 1919 to 1935. 

explanation of symbols 

The symbols to be employed in the computation of dif- 
ferent types of index numbers have the following meanings: 

po' : price of a given commodity at time “0” (the base period). 

qo' : quantity of same commodity at time “0”. 

pi : price of same commodity at time “ 1 

qi : quantity of same commodity at time “ 1 ”. 

po " : price of a second commodity at time "0”. 

q" : quantity of second commodity at time “0”. 

p" : price of second commodity at time “ 1 ”. 

q" : quantity of second commodity at time “ 1 ”. 

Vi 

— , : a price relative (relation of price of a given commodity at 
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time “ 1 ” to price of same commodity at time “0”). 

: a quantity relative. 

Qo 

Po : price level at time “0”. 

Pi : price level at time “ 1 ”. 

Simple Index Numbers op Prices 
In his exhaustive analysis of methods of index number 
construction 1 Irving Fisher distinguishes six fundamental 
types: the aggregative (or price aggregate), the arithmetic, 
harmonic, geometric, median, and mode. The latter has 
never been employed in a practical way, and may be omitted. 
The characteristics of the five remaining types may be 
brought out by considering each of them in its simplest form, 
before examining the more complicated combinations. 

AGGREGATES OF ACTUAL PRICES 
In the construction of index numbers of the simple ag- 
gregative type, commodity prices pertaining to a given 
date are added; general price changes are measured by 
comparing the results thus secured for different dates. Using 
the above symbols 

Pi _ 2pi. 

Po 2po 

When such index numbers are constructed from the data of 
Table 45 the results in Table 46 on page 182 are secured. 
The actual aggregates are given in column (2) ; to facilitate 
comparison the same figures are reduced to relatives, with 
the 1919 aggregate as base, in column (3). 

The results secured by this method of constructing index 
numbers of prices will be compared shortly with results 
secured from the same data by other methods. The chief 
weakness of this type of index number is obvious. This is 
not an unweighted nor yet an equally weighted index. 
The influence of each commodity upon the result is depend- 
ent upon the price of the unit in which it happens to be 
1 The Making of Index Numbers, Houghton Mifflin Co., 1922. 
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Table 46 


Index Numbers of Farm Crop Prices 
(Aggregates of actual prices) 


(1) 

(2) 

(3) 

Year 

Index 

(aggregate of 
actual prices ) 

Index, relative 
(1919 » 100) 

1919 

$36,349 

100 

1920 

26.790 

74 

1921 

18.690 

51 

1922 

19.913 

55 

1923 

21.838 

60 

1924 

23.142 

64 

1925 

23.831 

66 

1926 

22.499 

62 

1927 

19.291 

53 

1928 

19.584 

54 

1929 

21.339 

59 

1930 

18.290 

50 

1931 

13.211 

36 

1932 

9.503 

26 

1933 

13.691 

38 

1934 

20.723 

57 

1935 

12.844 

35 


traded. In the present index, hay, which is quoted by the 
ton, is given more weight than all the other 11 commodities 
combined, with flaxseed second in importance. The index 
secured by adding the quotations is weighted in an entirely 
illogical fashion and cannot be accepted as reflecting the 
course of farm crop prices. 

One method which has been employed for avoiding the 
unequal weighting caused by the difference in units in 
which different commodities are traded is to reduce all 
quotations to the same unit. Thus hay, rice, corn, cotton, 
and the other commodities might all be quoted by the 
pound, and these quotations added to secure the index. 
Yet this method, which has been employed in the con- 
struction of Bradstreet’s index, merely replaces one system 
of illogical weighting by an equally illogical one. Equal 
weight, if such is desired, is not given to all commodities 
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by this method. Thus, in 1919 hay was worth $.010075 
per pound, cotton $.356 per pound and rice $.059 per 
pound, cotton having a weight in an aggregate of per pound 
prices 6 times that of rice and 35 times that of hay. 

ARITHMETIC AVERAGES OF RELATIVE PRICES 

Another method employed in the construction of index 
numbers involves the reduction of each quoted price to a 
relative, with reference to the price of the same commodity 
at a certain basic date, these relative figures then being 
averaged by any of the conventional methods. The example 
in Table 47 illustrates the first phase of this process, data 
for two years being utilized. The year 1919 is taken as base. 

Table 47 

Computation of Relative Prices for the Construction of Index Numbers 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Commodity 

Unit 

Price , 1919 

Relative 

Price , 1920 

Relative 

Corn 

Bu. 

$ 1.343 

100 

$ .656 

48.8 

Cotton 

Lb. 

.356 

100 

.139 

39.0 

Hay 

Ton (sh.) 

20.150 

100 

17.780 

88.2 

Wheat 

Bu. 

2.131 

100 

1.433 

67.2 

Oats 

Bu. 

.702 

100 

.456 

65.0 

Wh. Potatoes 

Bu. 

1.580 

100 

1.128 

71.4 

Sugar 

Lb. 

.102 

100 

.053 

52.0 

Barley 

Bu. 

1.215 

100 

.716 

58.9 

Tobacco 

Lb. 

.390 

100 

.212 

54.4 

Flaxseed 

Bu. 

4.383 

100 

1.770 

40.4 

Rye 

Bu. 

1.331 

100 

1.256 

94.4 

Rice 

Bu. 

2.666 

100 

1,200 

1.191 

44.7 

724.4 


From these figures the arithmetic averages of relative 
prices in these two years may be readily computed. The 

7) 9 

formula for any single relative is When there are N 

V o 

relatives the formula for the index number at time “ 1 ” is 
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In the present case 

Index (1919) = = 100. 

724 4 

Index (1920) = = 60.4. 

Index numbers computed in this way for the years 1919 to 
1935, inclusive, are shown in column (3) of Table 50. 

This type of index number is usually termed an “un- 
weighted” index of relative prices. It is weighted, however, 
just as are the types illustrated in the two examples pre- 
ceding. The quantity employed as weight in each case is 
the amount of each commodity which would sell for $100 
in the base year. In the preceding example the following 
quantities have been employed as weights: 


Corn 

74.5 bu. 

Cotton 

280.9 lbs. 

Hay 

4 . 96 tons 

Wheat 

46.9 bu. 

Oats 

142.5 bu. 

Potatoes 

63.3 bu. 

Sugar 

980.4 lbs. 

Barley 

82.3 bu. 

Tobacco 

256.4 lbs. 

Flaxseed 

22.8 bu. 

Rye 

75.1 bu. 

Rice 

37.5 bu. 


What has been done, in effect, in the computation of the 
simple average of relative prices has been to determine the 
aggregate amount for which the above quantities would sell 
in each of the eleven years included. At 1919 prices each 
of the above quantities would sell for $100, the aggregate 
value being $1,200; at 1920 prices the aggregate value of 
the above quantities was $724.40. These aggregates, di- 
vided by 12, give the index numbers shown in column (3), 
Table 50: 100 for 1919, 60 (60.4) for 1920, etc. Thus the 
“unweighted average of relative prices” is in fact a weighted 
aggregate of actual prices. It is equally weighted in the 
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sense that the value of the quantity of each commodity 
employed as weight was equal to $100 in the base year, 1919. 1 

MEDIANS OF RELATIVE PRICES 

The median rather than the arithmetic mean may be 
employed in securing the average of the relative prices for 
each year. When the relatives in column (6) of Table 47 
are arranged in order of magnitude the following distribution 
is secured: 


39.0 

58.9 

40.4 

65.0 

44.7 

67.2 

48.8 

71.4 

52.0 

88.2 

54.4 

94.4 


The smallest relative price is 39.0, the greatest 94.4; 
the median value is 56.65. This median value is the index 
number for 1920. All the index numbers computed in this 
way from the medians of relative prices are presented in 
column (4), Table 50. 


GEOMETRIC AVERAGES OF RELATIVE PRICES 


The geometric averages of the relative prices for the 
various years may now be computed and the results com- 
pared with those secured in the preceding examples. A 

V ' 

single relative being represented by the symbol the 

P° 

formula for the geometric mean of N relatives is 


M„ 



v — 

A _ m 

V o 


X . . . . 


A geometric mean is generally computed by the aid of 
logarithms; in this case 


Log Mg = 


l0e (g) +l08 (^) +l0g (^) 


N 



1 Attention was called to this characteristic of the simple average of relative 
prices by F. R. Macaulay, American Economic Review , Dec., 1915, 928. 
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The method of computation may be illustrated for the 
years 1919 and 1920. The relative prices of the various 
commodities are repeated from Table 47. 

Table 48 

Computation of Geometric Averages of Relative Prices 


(1) 

(2) 

(3) 

(4) 

(5) 

Commodity 

Relative price, 

Logarithm of 

Relative price, 

Logarithm of 

1919 

fig. in col. (2) 

1920 

fig. in col. (4) 

Com 

100 

2.0 

48.8 

1.68842 

Cotton 

100 

2.0 

39.0 

1.59106 

Hay 

100 

2.0 

88.2 

1.94547 

Wheat 

100 

2.0 

67.2 

1.82737 

Oats 

100 

2.0 

65.0 

1.81291 

Wh. Potatoes 

100 

2.0 

71.4 

1.85370 

Sugar 

100 

2.0 

52.0 

1.71600 

Barley 

100 

2.0 

58.9 

1.77012 

Tobacco 

100 

2.0 

54.4 

1.73560 

Flaxseed 

100 

2.0 

40.4 

1.60638 

Rye 

100 

2.0 

94.4 

1.97497 

Rice 

100 

2.0 

44.7 

1.65031 



24.0 


21.17231 


Logilf, (1919) =~ = 2 

M a = anti-logarithm of 2 = 100 
Log Mg (1920) = = 1.76436 

Mg = anti-logarithm of 1.76436 = 68.1. 

This value, 58.1, is the index number for 1920. The 
results for all the years are summarized in column (5), 
Table 50. 

HARMONIC AVERAGES OF RELATIVE PRICES 

The characteristics of the harmonic average have been 
discussed in a preceding chapter. The reciprocal of the 
harmonic mean, it will be recalled, is the arithmetic mean 
of the reciprocals of the constituent measures. The con- 
stituent items, in the present case, are price relatives of the 
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t t 

form The reciprocal of such a relative is ;~r The 

Po P i 

formula for the harmonic mean of N price relatives is, 
therefore, 



The method of computation is illustrated in Table 49. 


Table 49 


Computation of Harmonic Averages of Relative Prices 


(1) 

(2) 

(3) 

(4) 

(5) 

Commodity 

Relative price. 

Reciprocal of 

Relative price, 

Reciprocal of 

1919 

fig. in col. (2) 

1920 

fig. in col. (4) 

Corn 

100 

.01 

48.8 

.02049180 

Cotton 

100 

.01 

39.0 

.02564103 

Hay 

100 

.01 

88.2 

.01133787 

Wheat 

100 

.01 

67.2 

.01488095 

Oats 

100 

.01 

65.0 

.01538462 

Wh. Potatoes 

100 

.01 

71.4 

.01400560 

Sugar 

100 

.01 

52.0 

.01923077 

Barley 

100 

.01 

58.9 

.01697793 

Tobacco 

100 

.01 

54.4 

.01838235 

Flaxseed 

100 

.01 

40.4 

.02475248 

Rye 

100 

.01 

94.4 

.01059322 

Rice 

100 

.01 

44.7 

.02237136 



.12 


.21404998 


H (1919) = -^ = 100 

The index numbers computed in this way for all the years 
included in the study are shown in column (6), Table 50. 
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In the construction of the five types of index numbers 
explained above no attempt has been made to use a logical 
weighting system. All are termed “unweighted” averages, 
a term which is quite misleading. The first index con- 
structed, based on aggregates of actual prices, is a heavily 
weighted index number, though the weights are illogical. 
In the next four the quantities employed as weights are the 
amounts purchasable for $100 in 1919. The five results 
are brought together and compared in Table 50. In each case 
the index is given to the nearest whole number. These index 
numbers are plotted in Fig. 52. 

Comparison of Simple Index Numbers 

The four averages of relative prices agree much more 
closely with each other than with the index numbers based 
on aggregates. For reasons already suggested the latter is 
quite untrustworthy as a measure of price changes. Of the 
other index numbers, the arithmetic, geometric, and har- 
monic means show a consistent relationship, a fact which 
follows from the nature of the averages employed. Except 
in the base year the geometric mean is always less than 
the arithmetic and the harmonic is always less than the 
geometric, the amount of difference increasing as the dis- 
persion of prices becomes greater. The median, with only 
twelve items to be averaged, is somewhat unstable, and its 
relationship to the other averages is not always a consistent 
one. 

How are we to choose among these varying results? No 
one of these “unweighted” index numbers is perfect, for 
weights which have crept in do not measure the relative 
importance of the various commodities included in the 
index numbers. But, neglecting for the moment the question 
of weights, is it possible to test the adequacy of the different 
methods of measuring changes in the prices as given? 
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Table 50 


Index Numbers of Farm Crop Prices, 1919-1935 
(1919 = 100) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 


Aggregates 

Arithmetic 

Medians 

Geometric 

Harmonic 

Year 

of actual 

averages of 

of 

averages of 

averages of 

prices {as 

relative 

relative 

relative 

relative 


relatives) 

prices 

prices 

prices 

prices 

1919 

100 

100 

100 

100 

100 

1920 

74 

60 

57 

58 

56 

1921 

51 

44 

42 

43 

42 

1922 

55 

51 

50 

50 

49 

1923 

60 

55 

50 

54 

53 

1924 

64 

60 

61 

59 

58 

1925 

66 

59 

53 

57 

55 

1926 

62 

53 

49 

52 

50 

1927 

53 

53 

55 

52 

52 

1928 

54 

48 

48 

47 

46 

1929 

59 

54 

53 

53 

52 

1930 

50 

38 

32 

36 

35 

1931 

36 

27 

27 

27 

26 

1932 

26 

20 

18 

19 

19 

1933 

38 

35 

33 

34 

34 

1934 

57 

48 

48 

46 

43 

1935 

35 

35 

36 

35 

34 



Fig. 52. — Comparison of Five Simple Index Numbers of Farm Crop 
Prices. 1919-1935 (1919 = 100) 
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THE TIME REVERSAL TEST 

For this purpose Irving Fisher has employed what he 
terms the “time reversal test.” This is merely a test to 
determine whether a given method will work both ways in 
time, forward and backward. If from 1935 to 1936 sugar 
should increase from four to eight cents a pound, the price 
in 1936 would be 200 per cent of the price in 1935, and the 
price in 1935 would be 50 per cent of the price in 1936. 
One figure is the reciprocal of the other; their product 
(2.00 X .50) is unity. Similarly, if a given method of index 
number construction shows the general price level in one 
year to be 200 per cent of the level in the preceding year, it 
should work correctly when reversed; it should show that 
the price level in the first year was 50 per cent of the price 
level in the second year. When the data for any two years 
are treated by the same method, but with the bases reversed, 
the two index numbers secured should be reciprocals of each 
other. Their product should always be unity. If it is not, 
there is an inherent bias in the method. 

This test may be applied to the methods employed above, 
using prices for 1919 and 1920. With 1919 as base the 
following results were obtained : 


Year 

Aggregates 
of actual 
prices (as 
relatives) 

Arithmetic 
averages of 
relative 
prices 

Medians of 
relative 
prices 

Geometric 
averages of 
relative 
prices 

Harmonic 
averages of 
relative 
prices 

1919 

1920 



100 

56.65 

100 

58.1221 

100 

56.0617 


and with 1920 as base : 


Year 

Aggregates 
of actual 
prices (as 
relatives) 

Arithmetic 
averages of 
relative 
prices 

Medians of 
relative 
prices 

Geometric 
averages of 
relative 
prices 

Harmonic 
averages of 
relative 
prices 

1919 

1920 

135.68122 

100 

178.36666 

100 

176.85 

100 

172.04 

100 

165.6467 

100 















WEIGHTING OF INDEX NUMBERS 191 


When the index numbers for 1920 in the first table are 
multiplied by the corresponding index numbers for 1919 
in the second table, we have the following values. (In 
seeming these products the index numbers are put in the 
ratio, not in the percentage form.) 


Aggregates 
of actual 
prices 

Arithmetic 
averages of 
relative 
prices 

Medians of 
relative 
prices 

Geometric 
averages of 
relative 
prices 

Harmonic 
averages of 
relative 
prices 

1.00 

1.0767 

1.00 


.9286 


This time reversal test is met by three of the methods 
employed. It is not met by either the arithmetic or har- 
monic averages. The former has a distinct upward bias, 
amounting to more than seven per cent when the errors for 
1919 and 1920 are compounded, while the harmonic mean 
shows almost as large an error in the opposite direction. 
Unless the inherent bias which is found in both these aver- 
ages is rectified in some way, methods based upon these 
averages should not be used in the construction of index 
numbers. 


The Weighting of Index Numbers 

Five simple index numbers of prices have been described 
in the preceding section. With the introduction of weighting 
the number of possible combinations is greatly increased, 
but only a few of these types need concern us here. 

In the construction of an accurate measure of price changes 
logical weights must be employed, weights which truly reflect 
the relative importance of the commodities included. If the 
weighting problem is ignored haphazard and illogical weights 
will inevitably be present, whether recognized or not. 

The data used in the preceding examples may be utilized 
to illustrate methods of weighting and to show the effects 
of varying weights upon the values of index numbers. The 








Annual Physical Production , Twelve Crops , 1919-1935 1 
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weights employed in constructing index numbers of farm 
crop prices may be either the quantities or values of the 
crops produced, depending upon the type of index selected. 
The quantities produced during the period 1919-1935 are 
given in Table 51. 

WEIGHTED AGGREGATES OF ACTUAL PRICES 

The thoroughly illogical results obtained when actual 
prices, as quoted, are totaled to secure an index number 
have been pointed out. The same objection cannot be 
made when the prices are appropriately weighted before 
the aggregate is taken. If for weights we employ the quan- 
tities produced in the base year (at time “0”) the formula 
for the weighted aggregate is 

0 qo 

This is, in effect, the method employed by the United States 
Bureau of Labor Statistics, though the quantities are taken 
from a year other than the base year. The formula for this 
type of weighted aggregative index is known as Laspeyres’ 
formula. The method is illustrated in Table 52. 

The desired index numbers, in the form of relatives, may 
be computed from the aggregates secured by totaling col- 
umns (5) and (8) of Table 52. Either year may be taken 
as the base, and the price aggregate in the other year ex- 
pressed as a relative on this base. With the 1919 aggregate 
as base the index for 1920 is 58 . 2. Index numbers similarly 
computed for the other years are given in column (2), 
Table 55. 

Another type of weighted aggregate may be constructed, 
with weights taken not from the base period but from the 
later period in the given comparison. That is, we may 
employ qi (quantity at time “1”) as weight in comparing 
prices at time “1” with prices at time “0,” and employ q 2 
(quantity at time “2”) as weight in comparing prices at 



Table 52 

Computation of Weighted Aggregates of Actual Prices 


(8) 

Price X weight 

Pi9» 

11,757,424,000 

792.995.000 
1,361,770,200 
1,364,359,300 

504.792.000 
335,354,400 

231.663.000 
93,867,600 

306.128.000 
11,982,900 
98,847,200 
50,843,790 

$6,910,027,390 

(7) 

Weight (quarir 
tity produced 
1919, in 
millions) 

?• 

AQ »-H CO *“H 1> CO 

Oiiocdcsir^.t^»— i^-H^cboocsi 

1>- O Cl CO^ »-H 

C<T to* *-H i-H 

1 

(6) 

Price, 

1920 

Pi 

CD03QMCD00W(D(NOCDH 

iOCOOOWiOC4iOHHNU5a> 

?OHN^^rHONCH>N»H 

I s *. t-H t— 1 t-H t-H 

t-H 


(5) 

Price X weight 

V o9» 

$ 3,597,897,000 
2,030,980,000 
1,543,288,500 
2,028,925,100 

777.114.000 

469.734.000 

445.842.000 
159,286,500 

563.160.000 
29,672,910 
104,749,700 
113,811,540 

S 

<N 

s 

£ 
i— T 

T-H 

(4) 

Weight ( quan- 
tity produced 
1919, in 

millions) 

Qo 

cr> r— . oa 

to T—l CO t-H I s - CO 

OiiOcb(N!^NtHr-H^cboO<M 
NONiOOOJNCO^ 1> TP 

cq^ o* ^ cq r— < 

CQ lO iH r>H 


(3) 

Price, 

1919 

P» 

fO^QHWQN^OMHtO 

TfiC»OfOO®0'H0506cOCO 

co co i-H t-h • c>» to »—t co co co co 

1— « *0031 r-H T—i r-4 d 

^ «• 


§ 1 

3 

3 J3 § 3 3 S J3 3 J3 3 S 3 


ttfipoumoQ 

(I) 

Com 

Cotton 

Hay 

Wheat 

Oats 

Potatoes, Wh. 
Sugar 

Barley 

Tobacco 

Flaxseed 

Rye 

Rice 
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time “2” with prices at time “0.” Algebraically, the 
formula for the index number at time “ 1 ” is 

Spigi 

Spogi' 

This is known as Paasche’s formula. The process of compu- 
tation is precisely the same as in the preceding example, 
except that the weights are changed with each successive 
year. The index numbers secured by this method are given 
in column (3), Table 55. 

The weights in these two cases have been quantities, for 
prices, multiplied by quantities, give aggregates in dollar 
values. But in weighting individual price relatives, quanti- 
ties will not serve. The abstract relatives must be weighted 
by values, if the resulting products are to be comparable. 
For values are in terms of a common dollar unit, while 
quantities may be expressed in a variety of units. The values 
which are to be employed as weights may be derived in 
various ways. 

Fisher 1 outlines the four following methods, of which the 
second and third are hybrid types: 

I. Each weight = base year price X base year quantity (pogo). 

II. Each weight = base year price X given year quantity (pogi). 

III. Each weight = given year price X base year quantity (pigo). 

IV. Each weight = given year price X given year quantity (pigi). 

Just as certain averages possess inherent bias, so a distinc- 
tive weight bias arises from each type of value weighting. 
(This inherent bias is absent from the quantity weighting.) 
A downward bias arises from weighting systems I and II 
(in which base year prices are used), while an upward bias 
arises from weighting systems III and IV (using prices in 
the given year). This is in part capable of mathematical 
demonstration 2 and has in part been established by numer- 
ous trials. 

1 Irving Fisher, The Making of Index Numbers, 54. 

* An index weighted by type III must exceed an index weighted by type I. 

(. Footnote 2 continued on page 196.) 
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In the several examples next following we shall deal only 
with values of quantities produced in the base year, 1919. 
These values are given in the third column of Table 53. For 
weighting purposes they are taken to the nearest million. 

WEIGHTED ARITHMETIC AVERAGES OP 
RELATIVE PRICES 

In the computation of an index of this type, each relative 
is multiplied by the appropriate weight and the sum of the 
products is divided by the sum of the weights. The process 
is illustrated in Table 53. 

The index for 1920, it will be noted, is identical with that 
secured from the computations illustrated in Table 52. That 
index is a weighted aggregate of actual prices, the weights 
being the quantities produced in the base year. An arith- 
metic mean of relative prices, weighted by values in the base 
year, is always equal to a relative constructed from such an 
aggregate. 1 

{Footnote 2 continued from page 195.) 

Weighting the price relative of a given commodity by type III, we have 

? X Ptfo 
Po 

while by type I we have 

* — X poqo. 

Po 

If pi exceeds p 0 (if the price relative is above 100) the weight by type III 
(pi^o) is greater than the weight by type I (poqo). That is, all relatives above 
100 are more heavily weighted by type III than by type I. But if pi is less than 
Po the weight by type III (pif/o) is less than the weight by type I {poqo). All 
relatives below 100 are less heavily weighted by type III than by type I. Thus 
the effect of all price increases is over-emphasized and the effect of all price 
declines is under-emphasized by type III, giving a net result always greater 
than type I. The same is true of type IV as compared with type II. As be- 
tween types I and IV there is no necessary relation, but in general an index 
weighted by type IV will exceed an index weighted by type I. Base year 
weighting involves a downward bias while given year weighting involves an 
upward bias. (For a more detailed discussion of bias in weighting see Fisher, 
The Making of Index Numbers , Chapter V and pages 384-387.) 

1 This may be readily demonstrated algebraically. The value of any com- 

P\ 

modity in the base year is potfo, while the price relative for a second year is — . 

Po 

{Footnote 1 continued on page 197.) 
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Table 53 


Computation of Weighted Arithmetic Averages of Relative Prices 


Com- 

modity 

Relative 

price f 

1919 

Weight 

Relative 

price 

X weight 

Relative 

pricey 

1920 

Weight 

Relative 

price 

X weight 

Corn 

100 

r, 3,598 
2,031 

$359,800 

48.8 

$3,598 

$175,582 4 

Cotton 

100 

203,100 

39.0 

2,031 

79,209.0 

Hay 

100 

1,543 

154,300 

88.2 

1,543 

2,029 

136,092.6 

Wheat 

100 

2,029 

202,900 

67.2 r 

136,348.8 

Oats 

100 

777 

77,700 

65.0 

111 

50,505.0 

Potatoes 

100 

470 

47,000 

71.4 

470 

33,558 0 

Sugar 

100 

446 

44,600 

52.0 

446 

23,192.0 

Barley 

100 

159 

15,900 

58.9 

159 

9,365.1 

Tobacco 

100 

563 

56,300 

54.4 

563 

30,627.2 

Flaxseed 

100 

30 

3,000 

40.4 

30 

1,212 0 

Rye 

100 

105 

10,500 

94.4 

105 

9,912.0 

Rice 

100 

114 

$11,865 

11,400 

$1,186,500 

44.7 

114 

$11,865 

1 5,095.8 

$690,699.9 


(The weights employed are the values of the quantities produced in 
1919, in millions.) 

Weighted arithmetic mean (1919) = = 100 

$ 11,000 

Weighted arithmetic mean (1920) = = 58.2 


( Footnote 1 continued from page 196.) 

The weighted mean of such price relatives is equal to 

x poV + ~ X Po' V' + X Po" V' + . . . 

Po Po Po 

po <lo + po T* po Qo *T • • • 

which reduces to 

Spigo 
Spotfo ’ 

a weighted aggregate of the type mentioned. 

In the same way the harmonic mean, weighted by full values in the second 
year, reduces to 

Spigi # 

Spo?i 

This has already been encountered as an aggregate of actual prices weighted 
by quantities in the second year. 
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WEIGHTED GEOMETRIC AVERAGES OF 
RELATIVE PRICES 

The process of computing the weighted geometric mean 
is identical with that of computing the unweighted geometric 
mean, except that the logarithm of each relative is multi- 
plied by the given weight and the sum of these weighted 
logarithms is divided by the sum of the weights, the result 
being the logarithm of the desired index. 1 The method is 
illustrated in Table 54. 


Table 54 


Computation of Weighted Geometric Average of Relative Prices, 1920 

(1919 = 100) 


Commodity 

Relative 
price , 1920 

Logarithm of 
relative price 

Weight 

Logarithm of 
relative price 
X weight 

Cora 

48.8 

1.68842 

3,598 

6074.93516 

Cotton 

39.0 

1.59106 

2,031 

3231.44286 

Hay 

88.2 

1.94547 

1,543 

3001.86021 

Wheat 

67.2 

1.82737 

2,029 

3707.73373 

Oats 

65.0 

1.81291 

777 

1408.63107 

Potatoes, Wh. 

71.4 

1.85370 

470 

871.23900 

Sugar 

52.0 

1.71600 

446 

765. 33600 

Barley 

58.9 

1.77012 

159 

281.44908 

Tobacco 

54.4 

1.73560 

563 

977. 14280 

Flaxseed 

40.4 

1.60638 

30 

48. 19140 

Rye 

94.4 

1.97497 

105 

207.37185 

Rice 

44.7 

1.65031 

114 

188. 13534 




11,865 

20,763.46850 


Log Mg = 

20,763.46850 

11,865 

= 1.74998, 



M t = 56.2 


The index for 1920 on the 1919 base is 56.2. Measure- 
ments secured for all the years of the period covered are 
given in column (5), Table 55, together with the other 
weighted index numbers already explained. 

1 The formula for the weighted geometric mean is given in Chapter IV. 
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How are we to judge of the relative merits of these three 
index numbers? We may, first, apply the time reversal 
test which was employed in comparing the five simple index 
numbers. This test is not met by any of the weighted types 
we have constructed. The geometric is equally at fault 
with the others. Though the simple geometric meets the 
test, the introduction of weighting imparts a bias to the 
result. Judged by that test alone none of the three is sat- 
isfactory. We may next try the second fundamental test 
that Fisher has developed, which is termed the "factor 
reversal test.” 


THE FACTOR REVERSAL TEST 


The total value of a given commodity in a given year is, 

of course, the product of the quantity produced and the 

price per unit; algebraically, it is equal to p'q'. The ratio 

of the total value in one year to the total value in the preced- 
/ / 

ing year is , - 7 - If, from one year to the next, both price 
Po qo 

and quantity should double, the price relative would be 200, 
the quantity relative 200, and the value relative 400. The 
total value in the second year would be four times the value 
in the first year. The value relative would be equal to the 
product of the price and quantity relatives, a relationship 
which is obvious in the case of a single commodity. 

If, for a number of commodities, we construct an index 
of the price change from one year to the next and an index 
of the quantity change from one year to the next, we should 
expect their product to be equal to the ratio of the total 
values in the second year to the total values in the first 
year. If the product is not equal to the value ratio, there 
is an error in one or both of the index numbers. 

As an illustration, we may apply this test to the first 


aggregative index constructed 



An index of quan- 


tities may be computed from this same formula, merely 
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interchanging the q’s and the p’s; the formula becomes 

SgiPo 

Sgopo 

The same price factor appears in numerator and denom- 
inator, as we desire to measure only the effect of the quan- 
tity change. Substituting the given values of the twelve 
farm crops we have 

Quantity index, 1920 (1919 = 100) = = 1-0956. 

In percentage form the index of quantities produced in 
1920 is 109.56, with 1919 as base. The corresponding price 
index, by the same formula, is 58 . 24. The product 
1.0956 X .5824 = .6381. 

That is, if prices have decreased 41.76 per cent, while 
quantities have increased 9.56 per cent, the total value 
should show a decrease of 36 . 19 per cent. 

For the value ratio we have 


^Pigi $7,441,317,450 __ 0272 

Spogo “ $11,864,461,250 ~ ' 

There is a discrepancy here of about one per cent. The 
actual error is not great, but the formula definitely fails to 
meet the factor reversal test, and cannot be accepted as 
satisfactory. 

When this test is applied to the second aggregative index 
we secure the following values for 1920, with respect to 
1919 as base: 

Price index = — 57.25 

2pogi 

Quantity index = = 107 . 69 

2g 0P1 

Product = .5725 X 1.0769 = .6165 
(In securing the product the index numbers are put in 
the ratio, not in the percentage form.) 

Here is an error of the same magnitude in the other direction. 
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The weighted geometric average also fails to meet this 
fundamental factor reversal test. With respect to both the 
geometric index and the aggregates we have, apparently, 
by the introduction of weights spoiled index numbers which 
in their simple form were unbiased. Yet weights we must 
have, if the index numbers are to represent the facts ac- 
curately. Neither a simple index nor a weighted form of a 
simple index will meet the two tests laid down as funda- 
mental. Professor Fisher tested 46 such formulas, of which 
only four (the simple geometric, median, mode, and ag- 
gregative) met the time reversal test, and none met the 
factor reversal test. 


THE “IDEAL” INDEX 

A way out of this difficulty is offered by the possibility 
of “rectifying” formulas in a crossing process, by averaging 
geometrically formulas which err in opposite directions. 
Professor Fisher has made exhaustive trials of all possible 
formulas by this process, finding thirteen formulas in all 
which met both tests. Of these he has selected one as 
“ideal,” from the viewpoint of both accuracy and simplicity 
of calculation. This ideal index is the geometric mean of the 
two aggregative types illustrated above. Its formula 1 is 

i / Spigo x Spigi 
r Spo^o 2po<}i 

This index may be computed readily, in the present 
instance, from the results already obtained. Thus for 1920 
we have 

Ideal index = V-5824 X .5725 
= .5774. 

In the customary percentage form this is 57.74. 

This index number meets both the time reversal and the 
factor reversal test. Applying the former : 

1 The same formula was developed independently by Bowley, Pigou, Walsh, 
and Young. See The Making of Index Numbers , xv, 240-242. 
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Index of prices, 1920 (1919 = 100) - 57.74 
Index of prices, 1919 (1920 = 100) = 173.18 
.5774 X 1.7318 - 1.00. 


For the factor reversal test, applied to the data for 1920 
(with 1919 as base), we have 


Index of prices 
Index of quantities 



Spigi 

2p 0 ?i 

2gipi 

ZqoPi 


57.74. 

108.62. 


Value ratio = 


2pigi 

2p oQo 


.6272. 


Product of price and quantity indices = . 5774 X 1 . 0862 = . 6272. 

The ideal index, the two weighted aggregates that enter 
into its construction and the geometric mean weighted by 



Fla 53. — Comparison of Four Weighted Index Numbers of Farm Crop 
Prices, 1919-1935 (1919 = 100) 


values in the base year are given in Table 55 for the years 
1919 to 1935. The index numbers are plotted in Fig. 53. 

The wide discrepancies that were found between the vari- 
ous simple index numbers do not appear when the weighted 
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Table 55 

Comparison of Weighted Index Numbers of Farm Crop Prices, 

1919-1935 


(1) 

(2) 

(3) 

(4) 

(5) 

Year 

Aggregative 
{weighted by 
base year 
quantities) 

Aggregative 
{weighted by 
given year 
quantities) 

Ideal index 
Geometric 
mean of in- 
dices in cols . 
(2) and (3) 

Weighted 
geometric average 
{weighted by 
base year 
quantities ) 


2)pi<7o 

Smi 








1919 

100.0 

100.0 

100.0 

100.0 

1920 

58.2 

57.2 

57.7 

56.2 

1921 

42.8 

42.0 

42.4 

41.5 

1922 

53.6 

53.1 

53.4 

52.9 

1923 

59.8 

59.7 

59.8 

58.1 

1924 

65.0 

64.3 

64.6 

64.4 

1925 

57.9 

56.3 

57.1 

56.5 

1926 

51.4 

49.2 

50.3 

49.6 

1927 

54.5 

54.3 

54.4 

54.3 

1928 

51.8 

51.1 

51.4 

51.2 

1929 

54.1 

53 3 

53.7 

53.4 

1930 

41.3 

39.6 

40.4 

39.4 

1931 

26.6 

25.5 

26.0 

25.3 

1932 

19.0 

18.9 

19.0 

18.1 

1933 

32.6 

32.2 

32.4 

32.2 

1934 

51.6 

52.1 

51.8 

49.5 

1935 

38.1 

37.6 

37.9 

37.8 


indices are compared. There are significant differences, but 
there is none of the erratic behavior of some of the simpler 
forms. 

Of these four types the ideal index probably serves 
as the best measure of the average price change between 
1919 and each of the given years. 1 It is designed, it should 
be remembered, to measure the change between two stated 
times, and not for intermediate comparison. The value of 
the index for 1933, for instance, is determined by the rela- 
tion between prices and quantities in 1919 and in 1933. 

1 The year 1919, which is here employed as base, is not a satisfactory stand- 
ard of reference for economic purposes. It was a disturbed year, marking a 
transition from war-time to peace-time conditions. 
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There is double weighting and the weights vary from year 
to year. If 1933 is to be compared with 1932 a new index 
is needed, in which the prices and quantities for 1933 and 
1932 alone are included. Direct comparison on the basis 
of the values for the ideal index given in the above table 
is liable to error, because of the weighting system employed. 

It is one of the merits of the geometric mean with constant 
weights that it permits the index for each year to be com- 
pared directly not only with the base year index, but with 
the index for any other year. The base may be shifted 
directly from the relatives, and the same result will be 
secured as if the computation were made from the original 
data. If this same system be followed with the ideal index 
no large errors may be expected, but strict accuracy will 
not be secured. 1 


SOME ALTERNATIVE TYPES 

The chief obstacles in the way of general adoption of the 
ideal index arise from the difficulty of obtaining annual or 
monthly quantities to use as weights, and from the time 
involved in its computation. Where accuracy is essential 
the latter is not a serious difficulty. As a substitute formula 
which is much more quickly calculated Fisher has proposed 

S(go + Q&P i 
2(<?o + Qi)Po 

This formula, which has also been recommended by Edge- 
worth and Marshall, is considered by Fisher to be “the 
best practical all-around formula, taking all four points 
into account — accuracy, speed, minimum legitimate cir- 
cular discrepancy, simplicity.” Results from this formula 
will generally differ from those secured from the ideal for- 

* If year to year comparison be a primary aim in a given instance, the ideal 
index may be constructed on the chain system. Link index numbers are first 
constructed, each year serving as base for the computation of the index for the 
succeeding year. These links may then be “ chained” with reference to a fixed 
base. Warren M. Persons has shown that the errors involved in following this 
method are cumulative, and may be serious if the links are chained for a number 
of years. 



Table 56 

Computation of Aggregative Index, Weighted by Combined Quantities 
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57.7 (index for 1920 on 1919 base, in percentage form) 
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mula by less than one fourth of one per cent. Table 56 
on page 205 illustrates the method of computation, data for 
1919 and 1920 being employed. 

This formula requires the same data as the ideal index, 
and these are not generally to be had. Usually it is only 
possible to secure comprehensive quantity figures at each 
census period, and for the intervening years constant weights 
must be employed. In such cases the weighted aggregative 

Zp igo 
Spotfo 

is probably the most generally useful type. The weighted 
geometric has many virtues, but is subject to a definite 
weighting bias. If no weights can be secured, or even ap- 
proximated, the simple geometric and the simple median 
are far better than any of the other simple types. The 
geometric mean is more generally useful than the median. 

An index number of prices is always based upon the study 
of a sample, the result being taken as representative of the 
entire field of prices from which the particular sample was 
drawn. Some method is needed, therefore, by which we may 
judge of the reliability of the different types of index num- 
bers, of their probable stability when computed from a 
number of successive samples. Some differences might be 
expected between index numbers based upon different sam- 
ples. With which type of index number would these differ- 
ences due to fluctuations of sampling be least? 1 

Truman L. Kelley 2 has attempted to measure the prob- 
able errors of the chief types of index numbers and has 
graded these types on the basis of excellence in this respect. 
Two index numbers, the weighted geometric mean and the 
weighted median, are given the highest grade, as being the 
most reliable, the least affected by fluctuations of sampling. 

1 The subject of sampling, in relation to the reliability of statistical measures, 
is discussed in greater detail below. 

* Truman L. Kelley, Statistical Method , New York, Macmillan, 1921* 
834-346. 
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Fisher’s ideal index is ranked somewhat lower, though above 
the weighted arithmetic and harmonic averages of price 
relatives. The simple unweighted arithmetic average of 
relatives is given the lowest rating in the list. 

For reliability, flexibility, and general excellence Kelley 
selects the weighted geometric mean as the best type of 
price index number. A ratio of aggregates 

Xpiw 

2poW 

with selected weights (not necessarily precisely equal to the 
quantities marketed or consumed) is given a total score, 
based on the essential requirements of a good index number, 
as high as that of the weighted geometric mean and higher 
than that of the ideal index. Weights other than actual 
quantities are used in order that there may be flexibility 
in the matter of weighting. 

The detailed discussion of procedures in the preceding 
pages has clearly shown that there are some definitely faulty 
formulas, obviously unsuited for use in the construction of 
index numbers serving ordinary purposes. Among the better 
formulas there are some differences in respect of liability 
to bias and character of data needed, and some variations in 
sampling reliability. The maker of index numbers will have 
these in mind in choosing a formula to employ under given 
conditions. A more important factor in his choice, however, 
will be the purpose to be served by the index number, the 
question it is designed to answer. A weighted aggregate of 
actual prices answers one question definitively. It gives, 
without equivocation, the aggregate cost of a fixed bill of 
goods at one period, in relation to the cost of the same bill 
of goods at another. A geometric mean of relative prices 
answers another question. It measures with accuracy the 
average ratio of the prices of given commodities at one period 
to corresponding prices at another period. Some questions 
(for example, that answered by an unweighted arithmetic 
average of relative prices) have little if any economic sig- 
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nificance. It is because one or two main questions have 
bulked large in economic discussion that emphasis has been 
placed upon the finding of a “best” type of index number. Yet 
the terms “best ” and “ideal ” are unfortunate, for they imply 
that some absolute standard exists, with reference to which 
all formulas may be tested. No such absolute criterion may 
be applied to the diversity of research problems that call 
for the construction of index numbers. On the basis of his 
knowledge of the characteristics of different formulas, the 
discriminating investigator will choose technical methods 
adapted to his data and appropriate to his purposes. 

Other Problems Involved in the Construction 
of Price Index Numbers 

The preceding section has dealt with the technical prob- 
lems connected with the averaging of a given set of data 
in order to secure an index number of price variations. 
Certain methods have been shown to be quite faulty, while 
certain others have been found to be appropriate for given 
purposes. One who would use index numbers with intelli- 
gence should understand fully the methods which have 
been employed in securing given results, in order that he 
may know precisely what the given figure is designed to 
measure and what degree of reliability attaches to it. 

Such problems as these are not the only ones which 
confront those who construct index numbers, nor are these 
considerations the only ones which users of index numbers 
should bear in mind. Of equal importance with problems 
of averaging and weighting are the practical questions con- 
nected with the selection of representative samples. The 
only completely accurate measure of the general level of 
commodity prices would be secured by determining the ratio 
between all money units, including credit, in circulation 
(account being taken of velocity of circulation) and all the 
physical units of goods exchanged for money over a given 
period. The rqeasurement of general price changes between 
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two periods would thus involve complete knowledge of these 
two factors for each of the two periods. Such knowledge, 
of course, cannot be had, so recourse must be had to the 
method of sampling. And primary importance attaches to 
the number of commodities and the character of the com- 
modities upon the prices of which a given index number is 
based. 


NUMBER OP COMMODITIES TO BE INCLUDED 

Here again we are confronted with a relation that has 
already been mentioned, the relation between methods and 
uses. Decision as to the number of commodities and the 
kinds of commodities to be included in a given case must 
rest upon the purpose for which the index is to be con- 
structed. Assuming that the index number is to serve as a 
measure of general changes in the price level, the ques- 
tion as to the number of commodities to be included may 
be easily answered — the larger the sample the more rep- 
resentative will be the results. The frequency polygon 
based upon a large sample will approach more closely to the 
ideal curve which would represent all price quotations than 
will that based upon a small sample. Thus, as a measure 
of general price changes, more confidence may be placed 
in the Bureau of Labor Statistics index, which is based 
upon 813 price quotations, than in Bradstreet’s, which was 
based upon 96 quotations, though the latter had particular 
virtues of its own. 1 Yet index numbers based upon a small 
number of quotations may not be ruled out as without 
value. Wesley C. Mitchell, whose researches have ma- 
terially increased our knowledge of the price system and of 
the characteristics of index numbers, has compared in detail 
index numbers based upon varying numbers of quotations. 
Unexpected similarities are found. Those constructed from 
a limited number of quotations reflect the broad movements 
of prices in much the same way as do those based upon the 

1 Bradstreet’s index was discontinued at the end of the year 1937. 
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prices of several hundred commodities. In important details 
there are differences, however, differences which may in- 
volve doubt as to the movement of prices in a given year. 
In such cases the index numbers based upon many quota- 
tions must be accepted as more accurate measures of general 
price movements, provided that the commodities included 
be equally representative of the various elements of the price 
system. 

For other purposes, however, index numbers based upon 
a limited number of quotations may be preferable. This 
is particularly true when a "sensitive” index is desired, one 
that will serve as a forecaster of general price movements 
rather than as a precise measure of changes in the general 
price level. Of this type is the Harvard sensitive price index 
based upon quotations on 13 basic commodities (raw ma- 
terials). The purposes of such an index are served by the 
selection of a limited number of commodities the prices of 
which are subject to extreme fluctuations, rather than by 
the inclusion of a great many commodities. Yet the uses 
to which an index of this type may be put are limited. 
The "sluggishness” of the many-commodities index number 
is a sluggishness which inheres in the price system, and 
which must be reflected in a faithful index of general 
prices. 

The question of the number of commodities to be included 
cannot be discussed apart from that of the character of 
these commodities. The representative character of an index 
number rests in part upon the number of price series in- 
cluded, but the nature of these series is of even greater 
importance. For there are highly significant differences in 
the behavior of the prices of different commodity groups. 
These groups of prices, their interrelations, their behavior, 
their relation to the functioning of the economic system 
and to the swings of prosperity and depression, are matters 
of immediate and practical importance to economists and 
business men. 
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PRICE GROUPS IN THE FIELD OF WHOLESALE PRICES 

Since an index number of wholesale prices must rest upon 
sample quotations, the sample must be representative, must 
include commodities whose prices are typical of the various 
elements in the price system. The division into elements 
for this purpose must be based upon the character of the 
price changes peculiar to the different groups. Of the 
groups thus distinguished, the most obvious are those rep- 
resenting different industries. Textile prices and steel prices, 
leather prices and the prices of chemicals are subject to 
different influences. Trade depressions and revivals do not 
affect all industries at the same time or in the same way, 
so that an index of wholesale prices must include quotations 
from all important industrial groups. If preponderant in- 
fluence upon an index is exerted by the prices of certain 
types of commodities, the index, by that much, loses its 
representative character. Thus Bradstreet’s index, it has 
been established, gave greater weight to cotton fabrics, 
hides and leather, and cured meats than was justified by their 
actual importance in trade, a fact which did not detract 
from its utility for some purposes but which lessened its 
value as a representative index of wholesale prices. 

The extent of these differences between the price move- 
ments of commodities in different industrial groups may be 
appreciated by comparison of the index numbers of whole- 
sale prices of grains and metals and metal products during 
the business recession that began in the summer of 1937. 

In order that an index may be representative it is not alone 
sufficient that all industries be given an appropriate number 
of representatives in the sample. Raw materials and man- 
ufactured goods show characteristic differences in their fluc- 
tuations, and fitting representation must be given to each 
of these groups. Prices of the former are, in general, more 
sensitive to changes in business conditions, their movements 
preceding those of manufactured goods and showing more 
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violent fluctuations. There are several reasons for this. 
Raw materials are traded in for purposes of manufacture 
and sale. When business improves after a period of depres- 
sion, increased demand on the part of consumers (or expected 
increase in demand) leads competing manufacturers to bid 
against each other for raw materials. It is in the raw ma- 
terial markets that the pressure of increased demand first 
centers, and this bidding generally causes prices to rise in 
these markets before the prices of other goods are affected. 
Similarly, at the first evidence of slackening trade manu- 
facturers’ demand for raw materials falls off. Business 
forces pure and simple play in the raw material markets 
with more freedom than in the markets for manufactured 
goods. Hence the tendency of prices in these markets to 
anticipate, in their movements, prices in other commodity 
markets. 

Additional reasons for the greater stability of prices of 
manufactured goods are found in the fact that these prices 
include a greater percentage of stable cost factors, and in 
the control over supply exercised by most manufacturers. 
Wages, interests, rents move more slowly and less violently 
than do commodity prices. The inclusion of these elements 
in commodity prices tends to render these prices more stable. 
Therefore, as commodities move forward from the raw stage 
to their final manufactured condition their prices include 
more and more of these stabilizing elements, and become 
less violent in their fluctuations. 1 Control over supply, 
which manufacturers possess in much higher degree than 
primary producers, makes possible the enforcement of defi- 
nite price policies by fabricators. Under these conditions, 
stable prices and variable output are usually found. 

Each of the groups last mentioned contains minor groups 
of commodities with distinct price characteristics. Within 
the raw material group there are marked differences between 

1 Cf. Mitchell, “The Making and Using of Index Numbers,” Bulletin 284 
U. S. Bureau of Labor Statistics, 44-45, for examples. 
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agricultural products, animal products, forest products, and 
mineral products. Agricultural products are affected by 
weather and crop conditions as well as by business conditions 
and, though subject to price fluctuations of some magnitude, 
reflect prevailing business conditions less accurately than do 
the prices of mineral products . 1 Animal and forest products 
appear to stand between these two with respect to the 
faithfulness with which they reflect business conditions in 
their price movements. Thus, in selecting raw materials 
for inclusion in a sample of price quotations from which a 
representative index number is to be constructed, fair weight 
must be given to these various classes . 2 

Manufactured goods, again, do not constitute a single 
homogeneous group with respect to their price movements. 
In so far as they are to be used for further production, or to 
undergo further manufacture, they resemble raw materials 
in relation to the bidding of competing manufacturers, and 
their prices, therefore, are characterized by relatively wide 
oscillations. In so far as the demand for them is for the pur- 
pose of final consumption, purely business forces have less 
weight, and their prices are more stable. Related to this 
argument is that which has already been presented, the 
increasing stability of prices as the stable elements of wages 
and overhead charges bulk larger in commodity costs. So, 
again, the sample price quotations from which an index of 
wholesale prices is to be constructed must include prices 
representative of producers’ and consumers’ goods, of goods 
in the intermediate as well as the final stages of manufacture. 

Other important divisions of the price system exist. The 
behavior of the prices of capital equipment differs from that 
of prices of goods intended for human consumption. The 

1 It should not be inferred from this that there is no relation between agri- 
cultural production and the prices of agricultural products, and general business 
conditions. The immediate price relation is frequently one of contradictory 
movements, and cycles in agricultural production are not synchronous with 
business cycles. But conditions in these two fields of economic activity are 
mutually related in many ways. 

* Cf. “The Making and Using of Index Numbers/’ 47. 
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prices of durable goods differ in their fluctuations from the 
prices of perishable goods. Goods imported into a given 
country and goods exported from that country are usually 
subject to the play of different forces. A representative 
index number of wholesale prices should be based upon price 
quotations drawn from all commodity groups marked by 
distinctive modes of behavior, with weight given to each in 
proportion to the relative importance in trade of the com- 
modities in that category. 

Price Comparisons over Time 

In the opening pages of this chapter the fact was noted 
that the degree of dispersion found in frequency distributions 
of price relatives depended upon the length of time covered 
in price comparisons. Hence, on statistical grounds, there is 
justification for the conclusion that the accuracy of well- 
constructed price indices is high for measurements extending 
over a short interval, and becomes progressively lower as 
the range of the time comparison increases. This conclusion 
is supported by other considerations. 

In Laspeyres’ formula, 

I = 

2po?o 

the price factor alone varies, as between numerator and de- 
nominator. The constant weighting factor, q 0 , is assumed to 
define quantities entering into trade in an unchanging system 
of income distribution, living standards, consumption habits, 
etc. This system, for which Sir George Knibbs has used 
the term “regimen,” is taken to be common to the two 
periods compared. If it is constant, and if the q’a which 
define its quantitative attributes are unchanged, then we may 
expect to measure with accuracy the one factor which does 
change — commodity prices. The condition we have here as- 
sumed is the orthodox one of ceteris paribus, the condition that 
factors other than the one subject to study remain constant. 

In fact, of course, the regimen does not remain fixed. 
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Changes in tastes and in consumption habits occur; changes 
in types of goods used as capital equipment take place; 
incomes shift, and the flow of goods is altered by changes in 
the distribution of buying power among consuming groups; 
the very price changes that we seek to measure bring altera- 
tions in the demand for given types of goods, and in the quan- 
tities produced. Of no small moment in the total situation 
are the changes that occur in the quality of goods that con- 
tinue to pass by the same trade names. The automobile of 
1938 is the same commodity, by name, as the automobile 
of 1910, but to the average consumer the later model repre- 
sents quite a different bundle of utilities. Similarly, steel, 
textiles, locomotives, even the staple articles of diet have 
undergone important quality changes. A comparison of 
price levels in 1910 and 1938 that depends for its accuracy 
on the assumption that all elements of economic life except 
prices have remained constant is suspect, indeed. 

Our difficulties are not removed if we take as the standard 
of reference the regimen of the second of the two periods 
compared. This is done in Paasche’s formula, 

/ = ?Ml. 

The system of consumption standards and all that goes 
with it may be of modern vintage in this case, but the 
differences between the regimens of the two periods com- 
pared is just as wide. We have not held constant non-price 
factors, and our measurement of price changes loses in 
accuracy, as a result. 

The method exemplified by the Ideal formula, that of 
employing weighting factors drawn from both periods, rep- 
resents one attempt at the solution of this problem, but it 
is far from perfect. The use of quantities drawn from the 
two regimens does not create a common regimen, the indis- 
pensable condition of full accuracy in such comparisons. 

The practical procedure in the face of this difficulty is to 
restrict our comparisons, if high accuracy is required, to 
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periods not widely separated in time. Consumption habits, 
living standards and technical production methods will be 
not widely dissimilar in two such periods, and hence the 
number of identical commodities common to the two periods 
will be large. Under these conditions considerable confi- 
dence may be placed in index numbers measuring average 
price changes. Comparison of price levels over longer periods 
may be desired, and may be justified, but the margin of error 
in the measurements may be expected to increase as the time 
span extends. Formal precision in weighting and in the selec- 
tion of acceptable formulas will not provide an escape from 
the unavoidable difficulties arising out of alterations in the 
basic conditions of economic life. Real continuity of in- 
dices covering a stretch of years is possible only on the 
basis of a persisting common regimen. 

These considerations support the claims of an index of 
the chain type, which involves the measurement of price 
changes between successive periods not far apart in time. 
Bruce D. Mudgett has advocated this procedure. The com- 
parison of price levels in two periods, close together in time 
and with similar regimens, will be accurate, if such an index 
as the Ideal be employed. The elements of such a chain 
may then be linked together, in attempting to measure 
price changes between non-consecutive periods. If the regi- 
mens of the non-consecutive periods differ materially, the 
accuracy of the comparison will probably not be high. But 
it is reasonable to believe that better results will be secured 
by bridging the intervening years in the manner proposed 
than by constructing a single far-flung index based only 
upon the widely dissimilar regimens of two periods far 
removed in time. 

The Wholesale Price Index of the United States 
Bureau of Labor Statistics 

The authoritative index of wholesale prices in the United 
States is that compiled by the United States Bureau of Labor 
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Statistics. This index was first constructed in 1902, for the 
period beginning with 1890. It was continued until 1913 
as an unweighted average of relative prices, the base of 
each relative being the average price of the given commodity 
during the ten-year period 1890-1899. Various revisions of 
procedure have been made since 1913. As it stands at pres- 
ent the index for any given period (week, month, or year) 
is a weighted aggregate of actual prices, the aggregate being 
expressed, to facilitate comparison, as a relative with 1926 
as the base. 

The index now includes 813 price series. (A single com- 
modity may be represented by several quotations, the prices 
for different grades or in different markets being given. 
Thus for raw cotton there are three quotations, Middling, 
New Orleans; Middling Upland, New York; and Middling 
Upland, Galveston.) In the' derivation of the aggregate for 
any date each price quotation is multiplied by a given weight, 
known as a “quantity weight” or a “multiplier.” This 
same weight is applied to the price quotation for the base 
period. The cross products thus obtained for the base period 
and the date in question are values of a stated quantity of 
goods; they differ only in respect of the price factor. The 
following tabulation illustrates the method as applied to 
cotton: 

Commodity 


Cotton, Middling, 

New Orleans 
Cotton, Middling 
Upland, N. Y. 

Cotton, Middling 

Upland, Galveston .077 6,297,729,000 484,925,133 

When this process is carried out for the entire 813 price 
series included, the sum of the values in the last column 
gives the index number for the given period, in this case 


Average price, 
November, 1937 
(per pound) 

Pk 

$.080 

.080 


Quantity 

weight 

(pounds) 

Ik 

1,399,496,000 

77,750,000 


Average price, 
November, 1937 
X quantity 

weight 

PkQh 

$111,959,680 

6,220,000 
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November, 1937. As published, this sum is expressed as a 
relative, the aggregate in 1926 representing 100. 1 The 
formula for the index measuring the level of wholesale prices 
at time “1,” with reference to the base level at time “0” is, 
thus, 

Pi _ Spiff* 

Pa Spo?* 

where g* represents the constant multipliers. The method 
of construction renders it possible to shift the base to any 
desired year or month, changing the given relatives to per- 
centages on the new base. 

This index number, therefore, is based upon the cost at 
wholesale of a bill of goods. The bill of goods remaining 
the same, the total cost changes as the prices of the various 
commodities change, and the index measures the effect of 
these changing individual prices upon the total cost. 

It is essential, of course, that the quantity used as mul- 
tiplier for each series of price quotations truly represent 
the relative importance of the commodity in question. The 
multipliers employed are approximations to the quantities 
actually marketed. Changes are made from time to time in 
these quantities, the revisions being applied, of course, to 
the base period aggregates as well as to the figures for later 
periods. In addition, when it is necessary to substitute one 
price series for a related one that has been discontinued or 
has lost significance, minor modifications are made in the 
multipliers so as to maintain comparability between the 
aggregates for periods preceding and periods following the 
date of substitution. 2 

The Bureau of Labor Statistics publishes index numbers 
of wholesale prices for 10 major and 45 minor commodity 

1 This base is, at the date of writing, twelve years removed in time. Adop- 
tion of a 1935-1937 base period is now being considered. 

2 The method of adjustment is explained in an article, " Revised Method of 
Calculation of the B. L. S. Wholesale Price Index,” by Jesse M. Cutts and 
Samuel J. Dennis, Journal of the American Statistical Association , December, 
1937, 663-674. 
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groups, as well as a general index for all commodities. The 
major groups include farm products, foods, hides and leather 
products, textile products, fuel and lighting, metals and metal 
products, building materials, chemicals and drugs, house fur- 
nishing goods, and miscellaneous commodities. The con- 
stituent elements of the index are also classified into raw 
materials, semi-manufactured articles and finished products, 
and measurements of price changes for these groups are 
computed. The National Bureau of Economic Research has 
constructed index numbers for various other categories of 
commodities, utilizing the quotations of the Bureau of Labor 
Statistics. These classes include raw and processed goods, 
durable and non-durable goods, producers’ goods and con- 
sumers’ goods, goods destined for use in capital equipment 
and goods destined for human consumption, foods and non- 
foods, and crops, animal products, minerals, and forest 
products . 1 The availability of index numbers for various 
significant classes of goods makes it possible to trace price 
changes with more precision, and to interpret them more 
accurately, than when dependence is placed upon a single 
all-embracing index. For the elements of the price system 
are marked by wide diversity in their behavior over both 
long and short periods of time. 

Other Price Index Numbers 

The measurement of price changes by the use of index 
numbers has not been confined to wholesale prices. Many 
variations of this device have been utilized in measuring 
price movements in other fields. It will be useful at this 
point briefly to indicate the character of some of these 
variations . 2 

1 See Prices in Recession and Recovery, N. Y., National Bureau of Economic 
Research, 1936, 492-540. 

* Detailed information concerning the character and content of a wide variety 
of index numbers, price and other, will be found in An Index to Business Indices t 
Donald H. Davenport and Frances V. Scott, Chicago, Business Publications } 
Inc., 1937. 
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INDEX NUMBERS OP RETAIL PRICES 

An index of retail food prices is published currently by 
the United States Bureau of Labor Statistics. The general 
methods employed are similar to those already explained in 
connection with the index of wholesale prices computed by 
that agency, with such differences as inevitably result from 
the nature of the material. 

Actual retail selling prices of 84 articles of food are secured 
biweekly from dealers in 51 representative cities throughout 
the United States. In weighting the quotations on foods 
of a single type (fresh vegetables, for example) in a given 
city, account is taken of the quantities of such foods con- 
sumed by an average wage-earner’s family in that city or, 
for some regions, in the district in which that city lies. In 
obtaining weights consumption by food groups is considered, 
rather than by specific commodities, since the commodities 
actually priced must be taken to represent similar commodi- 
ties for which no prices are collected. 

The combination for a single city (or geographical area) 
of food prices thus weighted yields an index for that regicn. 
The food cost index for the United States is computed from 
the aggregates for the 51 cities, each weighted according to 
the population of the area which the city is taken to repre- 
sent. Thus the weights entering the final index of retail 
food prices for the country as a whole represent quantities 
consumed by the average wage-earner’s family, and the 
population assumed to be affected by each series of quoted 
prices. The base of the index numbers, as published, is the 
average of the three-year period 1923-1925. 

The indices of retail food prices, together with index 
numbers of the prices of electricity and coal, at retail, are 
published in the Monthly Labor Review. 

The difficulties inherent in the problem of measuring 
wholesale price movements have been discussed at some 
length. The construction of index numbers of retail prices 
of the type just described presents even greater difficulties. 
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All the theoretical problem? arising in the former case are 
to be solved and, in addition, the practical difficulties of 
securing suitable weights, accurate price figures, and com- 
parable quotations are intensified. Because of the lack of com- 
modity standardization, and because of variations in business 
practice and local customs, the latter difficulty is particularly 
acute. For these reasons no index of retail prices at present 
published can be accepted with the confidence with which 
the best indices of wholesale prices may be received. 

INDEX NUMBERS OF THE COST OF LIVING 

If these problems are acute in constructing an index of 
retail prices they are doubly hard to solve in measuring 
such an entity as the cost of living. When food prices, 
rents, retail clothing prices, cost of fuel and light, retail 
furniture prices, and the prices of the other miscellaneous 
items which are included in the budget of the average family 
are to be averaged, and an index number constructed to 
measure variations in the cost of these items, numerous 
statistical difficulties must be overcome. Theoretical ques- 
tions concerning the most suitable methods of averaging 
and weighting present themselves, but more important are 
the practical problems involved in the collection of accurate 
and comprehensive prices and weighting data. 

Two index numbers of the cost of living are currently 
compiled in the United States, one by the Bureau of Labor 
Statistics, one by the National Industrial Conference Board 
of New York. The former appears in the Monthly Labor 
Review, the latter in periodic publications of the Conference 
Board. In each case the chief items of domestic expenditure 
are weighted in accordance with their relative importance in 
household budgets, and the combined results expressed as 
relative numbers. These are given on the 1913 and 1923-1925 
base by the Bureau of Labor Statistics, on the 1923 base 
by the Conference Board. 1 

1 For a general discussion of the problem, with details of the Conference 
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INDEX NUMBERS OP PRICE AND BUYING POWER OP 
FARM PRODUCTS 

A set of useful index numbers relating to the prices re- 
ceived by and the prices paid by farmers is compiled by the 
United States Department of Agriculture. The first of these 
is based upon the prices at the farm, as of the middle of 
each month, of 34 major farm products and 13 commercial 
truck crops. The weights employed are the average quan- 
tities marketed by farmers during the period 1924-1929. 
Farmers and agricultural economists have need of such a 
specialized index, because the wholesale prices of farm prod- 
ucts in the great exchanges or in large cities are often poor 
representatives of the prices actually received by farmers. 

The index of prices paid by farmers is compiled quarterly 
(in March, June, September, and December). The constitu- 
ent quotations are retail prices paid by farmers for commod- 
ities used in family maintenance and in production. Weights 
are estimated quantities bought by farmers. The base of 
the index of farm prices, as published, is the average of the 
five pre-war years from August, 1909 to July, 1914; that of 
the index of prices paid, 1910-1914. Measurements for sub- 
groups are given, in both cases. 

These two index numbers are used in the derivation of an 
index of the purchasing power of farm products. The com- 
putation of the purchasing power index may be illustrated 
with reference to the figures for 1936. In that year the index 
of prices of farm products was 114. The index of prices paid 
by farmers was 124. That is, the farmer was receiving 
14 per cent more, on the average, for a unit of product than 
in 1909-1914, but the average price paid by him for a unit of 
goods purchased was 24 per cent higher than in the base 
period. Therefore the purchasing power of an average unit 
of farm products was 8 per cent less than in 1909-1914 
(114+ 124=. 92). 

Board procedure, see Cost of Living in the United States , 1914-1986, M. Ada 
Beney, New York, National Industrial Conference Board, 1936. 
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These three index numbers, for selected years, are given 
in Table 57. 

Table 57 

Index Numbers of Farm Prices, Prices Paid by Farmers, and the 
Buying Power of Farm Products 1 


(1) 

(2) 

(3) 

(4) 

Average per unit 

Year 

Prices received 
by farmers 

Prices paid 
by farmers 

'purchasing power 
of farm products 

(2) -5- (3) 

1910-1914 

100* 

100 

100 

1918 

202 

176 

115 

1920 

211 

201 

105 

1921 

125 

152 

82 

1925 

156 

157 

99 

1929 

146 

153 

95 

1932 

65 

107 

61 

1933 

70 

109 

64 

1934 

90 

123 

73 

1935 

108 

125 

86 

i936 

114 

124 

92 

1937 

121 

131 

93 


* Aug., 1909-July, 1914 = 100. 

These are significant measurements, yielding valuable in- 
formation concerning the buying and selling relations that 
vitally affect one important group of producers. The devel- 
opment of similar measurements for other groups will add 
materially to our understanding of the changes that shifting 
market relations entail, in the economy at large. Yet the 
limitations of these index numbers should not be overlooked. 
The measurement of prices paid by farmers and, correspond- 
ingly, the measurement of the purchasing power of farm 
products, are subject to the difficulties referred to in the 
discussion of retail prices and living costs. Under existing 
conditions the margin of error in all such measurements is 
fairly wide. The error is the greater, too, the longer the 
time span covered by the quotations. In the present case, 
goods bought by farmers have undergone greater changes 

1 Source: The Agricultural Situation, U. S. Bureau of Agricultural Economics. 
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in quality than have the fairly standardized staples that 
the farmer sells. Here, as in all price comparisons over time, 
greater confidence must attach to short-period comparisons 
than to those spanning several decades. 
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CHAPTER VII 


THE ANALYSIS OF TIME SERIES: MEASUREMENT 

OF TREND 

The preceding sections have dealt primarily with frequency 
series and with problems arising in the attempt to organize 
and describe such series. We are now concerned with 
data in the study of which the essential problem is the 
analysis of chronological variations. Such series are of 
major importance in the field of economic statistics, for 
most of the data of economics and business are variables in 
time — as bank clearings, steel production, volume of sales, 
etc. This dominating importance of series in time is not 
found in any other field of statistical research, and the 
development of methods of analysis appropriate to time 
series has come, accordingly, only within recent years with 
the wider adoption of statistical methods in the field of 
economics. 

Problems connected with time series arise both in the 
ordinary routine of internal administration and in the 
analysis of general economic conditions. Sales, purchases, 
profits on the one hand, stock prices, interest rates, business 
failures on the other, are variables which fluctuate with the 
passage of time. In the analysis of such series it is generally 
desired that the rate and character of growth be determined, 
and that periodic and accidental fluctuations be isolated 
for study. The sales manager wishes to know how the vol- 
ume of sales is faring, when and why it fluctuates and how 
it compares with volume of production. The economist 
desires to trace the trend of prices, and to scrutinize minutely 
the upward and downward movements of the price level. 
The making of business plans on even a small scale, as 
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well as the most elaborate schemes of economic forecast- 
ing, must rest upon such study of past trends and fluctua- 
tions, and upon comparison of the movements of related 
series in time. Scientific study of the business cycle is only 
possible through the application of such methods. Our 
present task is the development of methods appropriate to 
the analysis of series in time. 

The Preliminary Organization of Time Series 

The data of time series usually require less preliminary 
organization than do statistical data which are to be reduced 
to the form of a frequency distribution. The source, pri- 
mary or secondary, from which the figures are taken usually 
presents them in shape for analysis. Certain precautions 
should be observed, however. 

The dates to which the figures apply should be clearly 
understood and definitely stated. Monthly data may be 
as of the first of each month (as for the stock price index 
of the New York Stock Exchange), averages for each month 
(as for the Bureau of Labor Statistics’ price index), or 
totals for each month (as in the case of figures on cotton 
consumption). They may be cumulative monthly figures, 
each item representing the total for the year to date, as 
in the case of certain coal production data. If average 
figures are given for a month or year it is important to 
know how the average has been secured. 

Again, it is essential that in any time series there be 
strict comparability between data for different periods. 
Any attempt to analyze a series that is not homogeneous 
must be misleading and futile. Yet such series are not 
infrequently published. Commodity production or con- 
sumption figures published by trade associations and by 
governmental agencies are sometimes based upon returns 
from a varying number of reporting concerns. A series 
of price quotations may lack comparability as between 
different dates because of changes in the unit or grade to 
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which the quotations apply, or because quotations are 
drawn from different markets. Changes in census classifica- 
tions may result in lack of comparability of census data. 
A change in a salesman’s territory may alter his returns 
materially. It is stated that the character of the obligations 
represented by the United States Steel Corporation’s figures 
for “unfilled orders” has varied from time to time. Records 
relating to the physical output of a given icommodity in 
different periods may be rendered inaccurate by changes 
in quality or design. These are examples of faults that 
may be found in time series, rendering analysis futile. 
Strict testing is essential before a series be accepted as 
accurate and homogeneous. 

GRAPHIC REPRESENTATION OP TIME SERIES 

Normally the first step to be taken in visualizing a series 
in time and in preparing for further analysis consists of 
plotting the data. The trend and general characteristics 
of a series may be most readily apprehended through 
graphic representation. The data may be plotted on ordinary 
arithmetic or on semi-logarithmic paper. The advantages 
of the latter types for certain purposes have already been 
explained. The choice in a given case will depend upon the 
nature of the data and the object of the study. If interest 
lies in the absolute amount of fluctuations in sales, prices, 
pig iron production or whatever may be in process of 
analysis, or in the comparison of absolute differences between 
series, the ordinary rectilinear chart is to be employed. 
If percentage variations and the comparison of relative 
fluctuations are matters of interest, the semi-logarithmiC 
representation is preferable. In general, if one is accustomed 
to the interpretation of this latter type of chart, its use js 
advisable. A clearer, less-distorted presentation of relations 
and a more significant comparison of series are generally 
secured when economic data having time as one variable 
are plotted on paper with a logarithmic ruling on one axis. 
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For some purposes the process of studying series in time 
will have been completed when the data are thus plotted. 
The general trend may be roughly determined from the 
chart. The existence of seasonal and other periodic varia- 
tions may be ascertained. Rough comparisons of trends 
and fluctuations may be made. All the knowledge thus 
secured, it should be noted, will be non-quantitative in 
character, and the comparisons will be tentative and 
approximate. Even so, such charts enable trends and 
relations to be much more clearly visualized than do the 
raw figures, and for some purposes the knowledge thus 
secured is sufficient, though it lacks precision and accuracy. 
For other purposes more exact measurement and more 
refined analysis are required. Certain appropriate methods 
may be described. 

Forces Affecting Series in Time 

The general object in the analysis of a time series is the 
isolation of the effects of one or more of the forces affecting 
the given series. This may be desired in order that the 
past behavior of the single series may be understood, in 
order that the future behavior of the series may be pre- 
dicted, or in order that two or more series may be compared. 
It is not in any case possible to isolate these effects of 
individual forces with absolute accuracy, and in some cases 
it is impossible even to approximate such a result. But 
given figures covering a sufficiently long period, the effects 
of various influences upon the behavior of a given series 
may usually be measured with some degree of accu- 
racy. 

What are these forces that affect series of data in time? 
The forces in any given case may be unique, affecting only 
the given series, but in general the various influences acting 
upon such series may be placed in a limited number of 
categories. 
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SECULAR TREND 

In the first place, most series of economic statistics exhibit 
definite trends. Such a trend may be constant in direction, 
may change direction at a constant rate, or may even be 
characterized by abrupt shifts in direction or rate that 
reflect the introduction of novel elements. Thus the volume 
of production or sales of a business house over a period 
of years usually shows a fairly regular growth. The same 
is true of population, the production of basic minerals, 
the number of motor vehicles registered, etc. In some 
cases the rate of growth may be a negative one, as is true 
of interest rates in the United States over the last half 
century. The concept of secular trend (i.e., trend over a 
long period of time) covers both positive and negative 
changes of this type. 

In the analysis of a time series the trend value at any 
date is taken to be the “normal” value at that date. T his 
conception of a normal value which may be used as a base 
or point of reference in judging the effects of all forces other 
than the growth factor, is fundamental in economic analy- 
sis. “No other method,” says Carl Snyder, “enables us so 
quickly to set economic events in their just perspective.” We 
should note, however, that such a normal value is essen- 
tially an empirical construction. While useful for purposes 
of reference, and as one of a series of measurements reflect- 
ing secular movements in a given series, it should not be 
assumed to possess any special normative significance. 

The fact should be emphasized that by secular trend is 
meant the smooth, regular, long-term movement of a statis- 
tical series. Frequent and sudden changes either in absolute 
amounts or in rates of increase or decrease are quite incon- 
sistent with the idea of secular trend. It is true that there 
may be occasional changes due to the interjection of a 
new element or the withdrawal of an old factor. But the 
breaking up into numerous sections of the period covered 
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by a time series, and the determination of trend for each 
of these minor periods, does violence to the very concept 
of secular change. 

It does not follow from this discussion that a definite 
rising or declining trend exists for all time series. Many 
series, such as barometric readings at a certain point, 
merely fluctuate about a constant level that does not change 
with the passage of time. 

PERIODIC FLUCTUATIONS 

If the plotted representation of a time series be studied, 
the long-term trend may be discerned in the general upward 
or downward drift, but may not be precisely determined 
by inspection because of the existence of numerous fluctua- 
tions, superimposed upon the trend. These fluctuations 
may be regular or irregular, violent or mild, simple or 
complex. The value of the variable at any given date 
may be thought of as the net resultant of the interaction 
of the secular trend and the various forces that tend to 
modify the persistent secular movements of a given series. 
(It may be, in fact, that for many series the trend is the re- 
sultant of the interplay of a variety of conflicting forces, 
rather than an underlying movement upon which the peri- 
odic and other fluctuations are superimposed. In the present 
discussion no attempt is made to define the organic relations 
that may exist among the forces affecting a series in time.) 
These latter forces may be of several types. 

Seasonal variations are found in most scries of economic 
statistics for which quarterly, monthly, or weekly values 
are obtainable. Consumption and production of commodi- 
ties, interest rates, bank clearings, railroad freight traffic, 
and many other types of data are marked by seasonal 
swings repeated with minor variations year after year. 
These, in so far as they exist at all, are definitely periodic 
in character, with a constant twelve-month period. Less 
markedly periodic, but nevertheless characterized by a 
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considerable degree of regularity, are the cyclical fluctuations 
that are found in series affected by forces connected with 
economic or business cycles. Prices, wages, the volume 
of industrial production, trading on the Stock Exchange, 
and most series relating to the activities of individual 
business units are affected by the swings of business through 
alternating periods of depression and prosperity. While 
the length of such periods may vary, the general sequence 
of change has been in the past sufficiently regular to render 
these cyclical movements capable of study. 

RANDOM FLUCTUATIONS 

Entangled with these more or less regular movements 
are the effects of random, accidental, and irregular fluctua- 
tions — catastrophic events such as the San Francisco earth- 
quake, wars, floods, fires, and countless minor events equally 
fortuitous though less violent in the resulting disruptions. 
These events influence the value of a variable at a given 
date, modifying the effects of long-term movements and of 
seasonal and cyclical factors. The observed value at any 
time is the resultant of the play of all these forces. 

The analysis of series in time involves the isolation of 
the effects of these various forces, so far as this is possible. 
A problem may call for the study of but one factor, or it 
may require the complete breaking up of given values. 
When annual data are used the seasonal element will not 
enter, of course. The explanation of methods begins with 
a consideration of problems involving only this type of data. 

The Measurement of Secular Trend 

As an example of the type of material in connection with 
which these problems arise, the figures in Table 58 on page 
232 may be taken. The values are given in thousands of 
millions in order to simplify the calculations. 

As has been pointed out, the figure for any year, as the 
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Table 58 

New York Clearing House Transactions, 1875-1936 


(In thousands of millions) 


1875 

25.1 

1891 

34.1 

1907 

95.3 

1923 

214.6 

1876 

21 6 

1892 

36.3 

1908 

73.6 

1924 

235.5 

1877 

23.3 

1893 

34.4 

1909 

99.3 

1925 

276.9 

1878 

22.5 

1894 

24.2 

1910 

102.6 

1926 

293.4 

1879 

25.2 

1895 

28.3 

1911 

92.4 

1927 

307.2 

1880 

37.2 

1896 

29.4 

1912 

96.7 

1928 

368.9 

1881 

48.6 

1897 

31.3 

1913 

98.1 

1929 

456.9 

1882 

46.6 

1898 

39.9 

1914 

89.8 

1930 

399.5 

1883 

40.3 

1899 

57.4 

1915 

90.8 

1931 

287.7 

1884 

34.1 

1900 

52.0 

1916 

147.2 

1932 

177.3 

1885 

25.3 

1901 

77.0 

1917 

181.5 

1933 

154.6 

1886 

33.4 

1902 

74.8 

1918 

174.5 

1934 

162.7 

1887 

34.9 

1903 

70.8 

1919 

214.7 

1935 

174.4 

1888 

30.9 

1904 

59.7 

1920 

252.3 

1936 

186.5 

1889 

34.8 

1905 

91.9 

1921 

204.1 



1890 

37.7 

1906 

103.8 

1922 

213.3 




value of $162.7 thousands of millions for 1934, is the net 
resultant of the many forces that we have classified under 
the headings of secular trend, cyclical variations, and ran- 
dom or accidental fluctuations. Our first problem is to 
measure the secular trend. 

In Fig. 54 the data of New York bank clearings during 
the period 1875-1936, inclusive, have been plotted. A 
definite trend is apparent, together with well marked and 
more or less regular deviations from that trend. Several 
methods are available for arriving at approximations to 
this trend. By employing moving averages an attempt may 
be made to eliminate passing fluctuations and to arrive 
at values that define the influence of the steadily operating 
growth factor. If we assume that a definite functional 
relationship prevails (empirically at least) between the time 
factor and the other variable, an approximation to the 
trend may be secured by fitting an appropriate curve to 
the plotted data. Smoothing the data by hand gives some- 
what the same result, the curve being frankly approxi- 
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986r 
9861 
*861 
SCSI 
3861 
1861 
0861 
6861 
8861 
£861 
9861 
9861 
*861 
8861 
8861 
1861 
0861 
6161 
8161 
£X6X 
9X61 
8161 
*161 
8161 
8161 
X 161 
0X6X 
606 X 
806 X 
£06 X 
9061 
S06X 
*06 X 
806 X 
806 X 
1061 
006 X 
668 X 
868 X 
£68 X 
968 X 
968 X 
*68 X 
868X 
868X 
I68X 
068X 
6881 
888 X 
£88 X 
9881 
988X 
*88X 
888 X 
888X 
X88X 
088 X 
6£8X 
8£8X 
££8X 
9£8I 
S£8t 




— New York Clearing House Transactions, 1875-1936, with Moving Averages 
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mative and empirical in character. In certain studies it 
has been found possible to use one statistical series as base 
or trend line for another series of homogeneous data. 

Moving Averages 

When a trend is to be determined by the method of 
moving averages, the average value for a number of years 
(or months, or weeks) is secured, and this average is taken 
as the normal or trend value for the unit of time falling 


Table 59 

New York Clearing House Transactions, 1912 - 1936 , and 3 -, 5 -, 7 -, 
and 9-Year Moving Averages 
(In thousands of millions) 


Year 

Original 

Three-year 

data 

moving av. 

1912 

$ 96.7 


1913 

98.1 

$ 94.87 

1914 

89.8 

92.90 

1915 

90.8 

109.27 

1916 

147.2 

139.83 

1917 

181.5 

167.73 

1918 

174.5 

190.23 

1919 

214.7 

213.83 

1920 

252.3 

223.70 

1921 

204.1 

223.23 

1922 

213.3 

210.67 

1923 

214.6 

221 . 13 

1924 

235.5 

242.33 

1925 

276.9 

268.60 

1926 

293.4 

292.50 

1927 

307.2 

323.17 

1928 

368.9 

377.67 

1929 

456.9 

408.43 

1930 

399.5 

381.37 

1931 

287.7 

288.17 

1932 

177.3 

206.53 

1933 

154.6 

164.87 

1934 

162.7 

163 90 

1935 

174.4 

174.53 

1936 

* 86. 5 



Five-year 

Seven-year 

Nine-year 

moving av. 

moving av. 

moving av. 


$ 104.52 



121.48 

$ 125.51 


136.76 

142.37 

$ 149.51 

161.74 

164.40 

161.44 

194.04 

180.73 

174.24 

205.42 

198.23 

188.11 

211.78 

207.86 

204.19 

219.80 

215.57 

218.60 

223.96 

230.20 

231.03 

228.88 

241.44 

245.78 

246.74 

249.29 

262.91 

265.52 

272.83 

285.64 

296.38 

307.63 

307.36 

340.66 

334.04 

315.62 

365.18 

341.50 

311.48 

364.04 

327.27 

302.49 

338.06 

307.44 

289.80 

295.20 

286.80 

276.58 

236.36 

259.01 

263.17 

191.34 

220.39 


171.10 
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at the middle of the period covered in the calculation of 
the average. Table 59 shows the results secured when three-, 
five-, seven-, and nine-year moving averages are thus 
computed for New York clearings for the period 1912-1936. 

The three-year moving average for 1916 is the average 
of the figures for 1915-16-17, the five-year figure for 1916 
is the average of the years 1914-15-16-17-18. The other 
averages are computed in the same way. In each case the 
average is centered for the period included; that is, the 
average is taken to represent normal as of the middle 
of the given period. The employment of an odd number 
of years simplifies this centering process, though it is not 
essential that the number be odd. With an even number 
of years, the figure may be centered by taking a two-year 
moving average of the first moving average. The three- 
and nine-year moving averages for the entire period are 
plotted, with the original data, in Fig. 54. 

It is obvious that the effect of the averaging is to give a 
smoother curve, lessening the influence of the fluctuations 
that pull the annual figures away from the general trend. 
The longer the period included in securing each average, 
the smoother is the curve secured, though there are other 
factors to consider in deciding upon the length of the period. 
Certain of these factors may be noted. 

CHARACTERISTICS OF MOVING AVERAGES 

Given cyclical fluctuations about a uniform level or about 
a line ascending with a uniform slope, the length of the 
cycle and the magnitude of the fluctuations being constant, 
a moving average having a period equal to the period of 
the cycle (or to a multiple of that period) will give a straight 
line, a perfect representation of the trend. Under the 
same conditions a moving average having a period greater 
or less than the period of the cycle will give, not a straight 
line, but a new cycle having the same period as the original, 
but with fluctuations of less magnitude. The minima and 
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maxima of the cycles thus obtained will not necessarily 
coincide with the minima and maxima of the original cycles. 
In general, when such a new cycle is obtained the magnitude 
of the fluctuations will be less the longer the period on 
which the average is based. 1 

These propositions may be illustrated by the figures in 
Table 60, arbitrarily chosen. In the first example five 
figures have been selected which repeat themselves in 
sequence, fluctuating about a common level. 

The moving averages in columns (2) and (3) represent 

Table 60 


Illustrating the Application of Moving Averages 


(1) 

(2) 

(3) 

(4) 

(5) 

Cyclical Moving average 
data of 5 items 

Moving average 
of 10 items 
{centered) 

Moving average 
of 3 items 

Moving average 
of 8 items 
(centered) 

2 





6 



51 


8 

61 


8 


10 

61 


71 


5 

6i 


51 

61 

2 

61 

61 

41 

6tt 

6 

61 

61 

51 

6} 

8 

61 

61 

8 

51 

10 

61 

61 

71 

5H 

5 

61 

61 

51 

61 

2 

61 

61 

41 

6H 

6 

61 

61 

51 

61 

8 

61 

61 

8 

51 

10 

6} 

61 

71 

5fi 

5 

61 

61 

5} 

61 

2 

61 


41 

6H 

6 

61 


51 


8 

61 


8 


10 



71 



5 

(The items in columns (3) and (5) have been centered by means of a 
moving average of 2 items.') 

1 The decrease in the magnitude of the fluctuations is not regular, however, 
but cyclical. 
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the data with the cycles completely removed. When the 
period of the average is not equal to the period of the cycle, 
or to a multiple of that period, the cycle is not removed, 
as is apparent from the figures in columns (4) and (5). 

The conclusions suggested above hold when the cyclical 
fluctuations take place about any straight line. In Table 61 
the foregoing data have been employed but with a constant 
increment of 3. This is equivalent to superimposing the 
same cycles upon a line with a slope of + 3. 

Table 61 

Illustrating the Application of Moving Averages to a Series with 

Linear Trend, 


(1) 

(2) 

(3) 

(4) 

(5) 

Cyclical 

data 

o 

Moving average 
of 5 items 

Moving average 
of 10 items 
(centered) 

Moving average 
of 3 items 

Moving average 
of 8 items 
(centered) 

9 



81 


14 

12i 


14 


19- 

15i 


16f 


17 

is* 


17* 

m 

17 

21J 

211 

191 

21H 

24 

241 

241 

231 

241 

29 

271 

271 

29 

26* 

34 

301 

301 

31} 

29H 

32- 

331 

331 

321 

33* 

32 

361 

361 

341 

36H 

39 

391 

391 

381 

391 

44 

421 

421 

44 

411 

49 

451 

451 

461 

44U 

47 

481 

481 

471 

481 

47 

5H 


491 

51H 

54 

541 


531 


59 

571 


59 


64 



61} 


62 






(The items in columns (3) and (5) have been centered by means of a 
moving average of 2 items.) 


The trend values, with the effect of the cycles completely 
removed, are secured by taking moving averages equal in 
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period to the cycle or to a multiple of that period. The 
cycle persists, with the same period but with diminished 
amplitude, when the average is based upon a period not 
equal to that of the cycle, as is clear from the figures in 
columns (4) and (5). 

When these ideally simple conditions of constant period 
and amplitude do not exist, the moving average becomes 
more ambiguous and its interpretation less simple. If the 
period of the cycle varies, the selection of a period for the 
moving average is more difficult. In general, a period 
equal to or greater than the average length of the cycle 
is to be selected. An average having a shorter period will 
give a line that is marked by pronounced cycles, these 
cycles being reduced as the period covered in the calculation 
of the average increases. 

When the amplitude of the cycle varies, the period being 
constant, a moving average with a period equal to the 
length of the cycle will give a line of trend marked by 
minor cycles. The amplitude of these secondary cycles 
will be a minimum when the period of the average is equal 
to the period of the cycle (or to a multiple of that period). 
When these last two irregularities are combined, and the 
data are characterized by cycles of varying amplitude and 
of varying length, the moving average giving the most 
effective representation of the trend is that which has a 
period equal to the average length of the cycle, or to a 
multiple of that length. 

A new factor enters when the trend departs from line- 
arity. If the underlying trend of a series is concave upward, 
a moving average will always exceed the actual trend value; 
if the reverse is true, and the trend is convex upward, a 
moving average will always be less than the actual trend 
value. 

These conditions are depicted in the following examples. 
The figures in Table 62 give the values secured when a 
cycle of constant period and amplitude, as in col. (3), is 
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superimposed upon a line of trend that is concave upward, 
i.e., increasing at a constantly increasing rate. If the mov- 
ing average could completely eliminate the effects of the 
cycle, the values secured from the average would be equal to 
the average value of the five items in each cycle (6.2) plus 
the values of the function y = x 2 , given in col. (2). 

Table 62 


Illustrating the Application of Moving Averages to a 
Non-Linear Series 
(Increasing rate) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 


r 2 

Cyclical 

Col. (2) plus 

Moving average 

True trend 
values 
(x 2 + 6 . 2) 

Js 


data 

col. (3) 

of 5 items 

0 

0 

2 

2 



1 

1 

6 

7 



2 

4 

8 

12 

12.2 

10.2 

3 

9 

10 

19 

17.2 

15.2 

4 

16 

5 

21 

24.2 

22.2 

5 

25 

2 

27 

33.2 

31 2 

6 

36 

6 

42 

44.2 

42.2 

7 

49 

8 

57 

57.2 

55.2 

8 

64 

10 

74 

72 2 

70.2 

9 

81 

5 

86 

89.2 

87.2 

10 

100 

2 

102 

108.2 

106.2 

11 

121 

6 

127 

129.2 

127 2 

12 

144 

8 

152 

152 2 

150.2 

13 

169 

10 

179 

177.2 

175.2 

14 

196 

5 

201 

204.2 

202.2 

15 

225 

2 

227 

233.2 

231.2 

16 

256 

6 

262 

264.2 

262.2 

17 

289 

8 

297 

297.2 

295.2 

18 

324 

10 

334 



19 

361 

5 

366 




The values of the moving average are, in this case, in 
excess of the true trend values, a form of distortion that 
will always occur with a series of this type. 

In Table 63 are shown the results of superimposing the 
same cyclical values upon a line of trend that is convex 
upward, i.e., increasing at a constantly decreasing rate. 
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In this case, a perfect method of eliminating the cycles 
would give results equal to the average value of the five 
items (6.2) plus the values of the function y - Vx. 

In this case the moving average values are consistently 
too low. The discrepancy is greatest for the lower values 
of x, as the decrease in the rate of growth is most marked 
for these values. 


Table 63 

Illustrating the Application of Moving Averages to a 
Non-Linear Series 


(Decreasing rate) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 






True trend 



Cyclical 

Col . (2) plus 

Moving average 

values 

X 

Vx 

data 

col (3) 

of 5 items 

1 t\MV WVW 

(Vi + 6.2) 

0 

0 

2 

2.00 



1 

1.00 

6 

7.00 



2 

1.41 

8 

9.41 

7.428 

7.61 

3 

1.73 

10 

11.73 

7.876 

7.93 

4 

2 00 

5 

7.00 

8.166 

8.20 

5 

2.24 

2 

4.24 

8.414 

8.44 

6 

2.45 

6 

8.45 

8.634 

8.65 

7 

2.65 

8 

10.65 

8.834 

8.85 

8 

2.83 

10 

12.83 

9.018 

9.03 

9 

3.00 

5 

8.00 

9.192 

9.20 

10 

3.16 

2 

5.16 

9.354 

9.36 

11 

3.32 

6 

9.32 

9.510 

9.52 

12 

3.46 

8 

11.46 

9.658 

9.66 

13 

3.61 

10 

13.61 

9.800 

9.81 

14 

3.74 

5 

8.74 

9.936 

9.94 

15 

3.87 

2 

5.87 

10.068 

10.07 

16 

4.00 

6 

10.00 

10.194 

10.20 

17 

4.12 

8 

12.12 

10.318 

10.32 

18 

4.24 

10 

14.24 



19 

4.36 

5 

9.36 




Considerations previously reviewed have indicated that 
a moving average should, in general, be based upon a 
period at least equal to the period of the cycle, and prefer- 
ably equal to some higher multiple of that period when the 
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data are at all irregular. The longer the period covered, 
the greater the stability of the average. But when the 
underlying trend departs materially from the linear form, 
following a curve bending upward or downward, the error 
involved in the use of any moving average increases as 
the period of the average increases. If a moving average 
is used in such a case to measure the trend, the period of 
the average should be the shortest which will serve to 
average out the cycles; equal, that is, to the average length 
of one cycle. 

In practice, however, these various conditions are found 
in complicated combinations. The fact that cycles vary in 
amplitude and length calls for a moving average based 
upon a fairly long period. The fact that the trend of the 
data is usually non-linear calls for a short period average 
to lessen the upward or downward distortion. A considera- 
tion of some importance in practical work is that a moving 
average can never be brought up to date. The lag is less, 
of course, the shorter the period covered by the average. 
The selection of a period in a given case must rest upon a 
study of the actual data with these various considerations 
in mind. 

It has been assumed in the preceding discussion that the 
purpose of the moving average is the representation of 
secular trend. The moving average may be used, also, in 
smoothing data for the purpose of eliminating random 
fluctuations. For this purpose a moving average based 
upon a period shorter than the average length of the cycle 
should be selected. 

We may return now to the problem relating to New York 
bank clearings. A study of the lines marked out by the 
different moving averages in Fig. 54 reveals significant 
differences between them. The three-year average follows 
the graph of the original data most closely, as would be 
expected. The nine-year average marks out the smoothest 
line of trend, but, on the other hand, departs most widely 
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from the data. This is particularly noticeable from 1893 to 
1898, from 1911 to 1915, from 1921 to 1926, and from 1927 
to 1931. It is due to the pronounced changes in the rate 
of growth of the series during these periods. Except for 
these distortions the general trend seems to be most accu- 
rately represented by the nine-year average. 

In determining the relative merits of the different moving 
averages we are aided by a knowledge of the course of 
business during the period covered. The volume of New 
York bank clearings is a sensitive index of general business 
conditions, responding immediately to changes in specula- 
tive and industrial activity. Major and minor business 
cycles are reflected in this series. Knowing the number of 
cycles through which business has passed during the period 
1875-1936, we may determine which of the moving averages 
serves best as a standard from which to measure cyclical 
deviations. In this case we are practically working back- 
ward from a known result, a method not always available. 

If we take as a starting point in each cycle the year in 
which revival began, after recession, the following cycles 
in general business activity may be distinguished: 1 


1871-1879 

1908-1912 

1879-1885 

1912-1914 

188.5-1888 

1915-1919 

1888-1891 

1919-1921 

1891-1894 

1921-1924 

1894-1897 

1924-1927 

1897-1900 

1927-1933 

1901-1904 

1933- 

1904-1908 



The cycles marked out by the three-year moving average 
are too numerous to enumerate. In fact, the deviations 
from this average are primarily accidental and minor 

1 These dates are based upon the chronology of American business cycles 
developed by Wesley C. Mitchell; cf. “Production during the American Busi- 
ness Cycle of 1927-1933,” by Wesley C. Mitchell and Arthur F. Burns, Bul- 
letin 61, National Bureau of Economic Research, November 9, 1936. It should 
be noted that the chronology is based upon monthly data, whereas the Clearing 
House data cited in the text are annual figures. 
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fluctuations and should not be classed as cycles. Deviations 
from the five-, seven-, and nine-year averages mark out the 


following cycles: 
Cycles of deviation? 

Cycles of deviations 

Cycles of deviations 

from five-year 

from seven-year 

from nine-year 

moving average 

moving averages 

moving averages 

1879- J 885 

1879-1885 

1879-1885 

1885-1888 

1885-1888 

1885-1888 

1888-1891 

1888-1894 

1888-1897 

1891-1897 

1894-1900 

1897-1900 

1897-1900 

1900-1904 

1900-1904 

1900-1904 

1904-1908 

1904-1908 

1904-1908 

1908-1911 

1908-1915 

1908-1911 

1911-1915 

1915-1923 

1911-1915 

1915-1918 

1923- 

1915-1918 

1918-1923 


1918-1924 

1923-1927 


1924-1927 

1927-1932 


1927-1932 

1932- 


1932- 




Some of the differences between the series of cycles thus 
determined and the reference cycles distinguished by 
Mitchell are doubtless due to the distinctive behavior of 
New York clearings. Other differences reflect the peculiari- 
ties of moving averages. Deviations from the five-year 
averages between 1879 and 1927 show one more cycle than 
we find in the series based on seven-year averages, four 
more cycles than are shown by the nine-year averages. 
And yet the deviations from five-year averages fail to show 
the cycles of 1894-1897 and of 1921-1924. The nine-year 
averages reveal only eight cycles between 1879 and 1927, 
as against Mitchell’s fourteen reference cycles. Mitchell 
was working, of course, with monthly data which are 
more sensitive than annual data to cyclical forces. More- 
over, he was dealing with relatively short movements, some 
of which appear as only minor fluctuations in general business 
activity. 

If interest attaches to the shorter swings of business, 
to cycles with average durations of three or four years, 
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a moving average of relatively short period should be used. 
A five-year average is appropriate. Averages of longer 
period define trend movements more faithfully, but may 
fail to reveal fluctuations properly classified as business 
cycles. We should refer, however, to recent attempts to 
establish the reality of long cycles, of nine, eleven, or as 
many as thirty years in average duration. In the study 
of such cycles moving averages of corresponding periods 
would be employed. 

In general, the moving average has the prime advantage 
of flexibility. The representation of secular trend by mathe- 
matical curves frequently involves the breaking up of a 
period into two or three subdivisions, and the fitting of 
separate curves to each. This results from changing condi- 
tions and sharply changing rates of growth or decline. 
Where such changes occur the moving average has the 
merit of flexible adaptation to the new conditions and is 
often a more effective measure of secular trend than curves 
fitted with great labor. 

Simple and weighted moving averages, in varying com- 
binations, have wide uses in the analysis of economic time 
series. An illuminating discussion of these uses, and of the 
procedures appropriate to different purposes, is to be found 
in The Smoothing of Time Series, by Frederick R. Macaulay . 1 

Representation of Secular Trend by Mathematical 

Curves 

For many types of data the secular trend may be repre- 
sented by a mathematical curve rather than by a line 
based upon a moving average. Thus, if the growth (or 
decline) is by constant absolute increments (or decrements) 
a straight line will serve as an exact representation of the 
trend. Or the growth may be by constant percentages, 
as in the case of capital increase, when a principal sum 
increases in accordance with the compound interest law. 

1 National Bureau of Economic Research, New York- 1931. 



MATHEMATICAL CURVES 245 

A curve of a definite mathematical form furnishes the best 
representation of this trend. In many series of economic 
statistics the data seem to conform to definite laws of 
growth, or decline, and where this is the case the task of 
analysis, interpretation, and projection is materially assisted 
by securing a mathematical expression for the underlying 
law. In practically all cases, of course, there are departures 
from this law, deviations above and below the line of secular 
trend. These deviations, however, do not destroy the value 
of an equation that describes the underlying law of develop- 
ment. 

There is one fundamental difference between the moving 
average as a measure of trend and such mathematical 
curves. The former implies no definite “law” to which 
the data are assumed to conform. It is based upon the 
data as given; if the general trend changes, the moving 
average follows the new trend. It is a flexible measure of 
trend, adapting itself to changing conditions, purporting 
to be nothing more than an empirical approximation to 
the drift of the series. Mathematical curves fitted to eco- 
nomic series are, in fact, nothing more than empirical 
approximations also, but in a somewhat different sense. 
They assume a “law” of change underlying the variations, 
accidental and otherwise, which show upon the surface of 
the data. It is an empirical law which is assumed, it is 
true, but nevertheless there is postulated a uniform and 
consistent trend capable of mathematical expression. If 
such an assumption is to have any validity it is essential 
that the period during which the law is supposed to hold 
be homogeneous, that there be no material changes in 
the conditions affecting the series being studied. Thus 
an equation is secured for the trend of gold production, 
let us say. If a radical change should take place in methods 
of extraction the trend of gold production would change 
materially and the former equation would no longer apply. 
Data covering the period before and after such a change 
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would not be homogeneous, and a single equation for the 
trend during the whole period should not be secured. 

In the practical approach to a problem involving the 
determination of secular trend the first task is the selection 
of the appropriate type of curve. This is perhaps the most 
difficult part of the work; certainly it is the part in which 
the element of personal judgment enters most directly. 
For there is no objective rule to follow, no fixed standard 
by which the most appropriate curve may be selected. 
Something more will be said on this subject after the 
characteristics of the chief types of curves and the methods 
of fitting them have been described. For the present it 
may be assumed that a curve similar to one of the types 
described in Chapter II, or to a related form, has been 
selected, and that we face the practical task of fitting it to 
the data. 

FITTING A STRAIGHT LINE; THE METHOD OF LEAST SQUARES 

If the data, when plotted, show a trend that can best 
be represented by a straight line the task of fitting is 
merely the determination of the constants in an equation 
of the form y = a + bx. The values of a and b which 
will give a line following most closely the trend of the 
data are to be obtained. A simple illustration may serve 
to demonstrate the various methods which may be employed. 
Nine points (1, 3; 2, 4; 3, 6; 4, 5; 5, 10; 6, 9; 7, 10; 8, 12; 
9, 11) are plotted in Fig. 55. Our problem is the fitting of 
a straight line to these points. 

By inspection approximate values of a and 6 may be 
determined. A thread may be stretched through the 
points in such a direction that it seems to follow the trend 
as closely as possible. The slope of the line thus laid out 
may be measured, the ^-intercept determined, and the 
desired equation thus approximated. Obviously this is a 
loose and uncertain method, and the results obtained by 
different individuals may be expected to vary rather widely. 



THE METHOD OF LEAST SQUARES 247 


There is one and only one straight line that fits the plotted 
data most accurately. The constants for this line of best 
fit may be determined by the method of least squares. 

The theory upon which the method of least- squares is 
based need not be detailed at length here. The argument 
may be briefly presented: A number of observation values 
of a certain quantity are found, and it is desired to obtain 
the most probable value of the quantity which is being 
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Fia. 55. — Illustrating the Fitting of a Straight Line to Nine Points 

measured. It is capable of demonstration that the most 
probable value of the quantity is that value for which the 
sum of the squares of the residuals is a minimum. (The 
“residual” is a term for the difference between a given 
estimated value and an observation value.) This is true 
of the arithmetic mean of the observation values. Thus, 
if a given distance be measured by a number of individuals, 
with varying results, the most probable value is the arith- 
metic mean of the different measurements. The process 
of computing the mean involves the following steps, which 
are enumerated for the purpose of simplifying the later 





248 MEASUREMENT OF TREND 

explanation. We seek a result, a statement of the most 
probable value of the distance being measured, which will 
take the form: 

M = (a constant). 

Let us say we have three approximations to this value : 

M = 5,672 feet 
M = 5,671 feet 
M = 5,676 feet 
adding, 3 M = 17,019 feet. 

Since there is but one unknown, M, it may be derived 
directly from this equation, and we have 

M = 5,673 feet. 

This is the value for which the sum of the squares of the 
deviations is a minimum. 

A similar problem arises when the relation between two 
variables is being measured. Our goal in this case is the 
equation that correctly describes this relationship. We 
have secured, however, varying results which do not agree 
precisely as to the constants in the equation of relationship. 
In other words, our plotted points do not all lie on the 
same line. What are the most probable values of the con- 
stants in the required equation? The answer is analogous 
to that given when a single quantity was being measured. 
We seek the constants which, when the resulting equation 
is plotted, will give a line from which the deviations of 
the separate points, when squared and totaled, will be a 
minimum . Assuming that each pair of measurements gives 
an approximation to the true relationship between the 
variables, we wish to find the most probable relationship, 
and this is given by the line for which the sum of the squared 
deviations is a minimum. 1 

We have, in the present example, nine pairs of values for 
x and y. Substituting these values in the generalized form 

1 Cf. Appendix A for a more detailed discussion of the method of least squares, 
together with a description of certain checks upon the calculations. 
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of the linear equation, y = a + bx, we secure the following 
observation equations: 

3 = a + 16 

4 = a + 26 

6 = a + 36 

5 = a + 46 
10 = a + 66 

9 = a + 66 

10 = a + 76 
12 — a H - 86 

11 = a + 96. 

Any two of these equations could be solved as simultaneous 
equations, and values of a and b secured. But these values 
would not satisfy the remaining equations. Our problem 
is to combine the nine observation equations so as to secure 
two normal equations, which, when solved simultaneously, 
will give the most probable values of a and b. The first 
of these normal equations is secured by multiplying each of 
the observation equations by the coefficient of the first 
unknown (a) in that equation, and adding the equations 
obtained in this way. Since the coefficient of a in the present 
case is 1 throughout, the nine observation equations are 
unchanged by the process of multiplication. The second 
of the normal equations is secured by multiplying each 
of the observation equations by the coefficient of the second 
unknown (6) in that equation, and adding the equations 
obtained. Thus the first equation is multiplied throughout 
by 1, the second by 2, and so on. The process of securing 
the two normal equations is illustrated in Table 64 on 
page 250. 

The two normal equations are 

70 = 9a + 456 
418 = 45a + 2856. 

It remains to solve these equations for a and b. By multi- 
plying the first equation by 5 and subtracting it from the 

68 

second, a may be eliminated; a value of or 1.133, is 
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Table 64 ' 

Derivation of Normal Equations from Observation Equations 


3= a + lb 

4 = a + 2b 
6 = a + 3 b 

5 = a + 4b 
10 = a + 5b 

9 = a + 6b 

10 = a + 7b 
12 = a + 8b 

11 = a 9b 
70 = 9a + 45b 


3 = la -f- lb 
8 = 2a + 4b 
18 = 3a + 9b 
20 = 4a + 16b 
50 = 5a + 25b 
54 = 6a + 36b 
70 = 7a + 49b 
96 = 8a + 64b 
99 = 9a + 81b 
418 - 45a + 285b 


found for b. Substituting this value in either of the equa- 
tions, a value of 2.111 is secured for a. The equation to 
the best fitting straight line is, therefore, 

y = 2.111 + 1.133a;. 

In the actual application of the method it is not necessary 
to write out and total the equations, as is done above. 
We need only insert the proper values in the two equations, 1 

2 ( 2 /) = na + b2(x) 

2(x?/) = a2(x) + b2(x 2 ). 

The symbols employed have the following meanings: 

2 (y) : the sum of the values of y. 

2(x) : the sum of the values of x. 

2(xy) : the sum of the products of the paired x’s and y’a. 

2(x 2 ) : the sum of the squares of the values of x. 

n: the number of pairs of values; the number of points 
plotted. 

The work of computation is facilitated by a tabular 
arrangement similar to that shown in Table 65. 

The two desired normal equations are secured by sub- 
stituting these five values in the type equations given 
above. It will be noted that the results are identical with 
those obtained from the observation equations. 

1 General rules for the formation of normal equations are given in Ap- 
pendix A. 
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Table 65 

Computation of Values Required in Fitting a Straight Line 


X 

y 

xy 

X 2 


1 

3 

3 

1 

n = 9 

2 

4 

8 

4 

2(x) = 45 

3 

6 

18 

9 

2(1/) = 70 

4 

5 

20 

16 

S(**) = 285 

5 

10 

50 

25 

2{xy) = 418 

6 

9 

54 

36 


7 

10 

70 

49 


8 

12 

96 

64 


9 

11 

99 

81 


45 

70 

418 

285 



When the equation to the best fitting straight line has 
been obtained the values of y corresponding to given values 
of x may be computed and compared with the observed 
values. Table 66 presents the results secured: 

Table 66 


Comparison of Observed and Computed Values of a Variable Quantity 1 


X 

y 

{observed) 

y 

{computed) 

d 

d 1 

xd 

1 

3 

3.2$ 

- 2$ 

.0597 

- 2$ 

2 

4 

4.3$ 

- .3$ 

.1427 

- -7$ 

3 

6 

5.5$ 

+ .4$ 

.2390 

+ 1-4* 

4 

5 

6.6$ 

- 1.6$ 

2.7041 

- 6.5$ 

5 

10 

7.7$ 

+ 2.2$ 

4.9381 

+ 11.1* 

6 

9 

8.9$ 

+ o$ 

.0079 

+ .5* 

7 

10 

10.0$ 

- o$ 

.0020 

- 3$ 

8 

12 

11.1$ 

+ .8$ 

.6760 

+ 6.5$ 

9 

11 

12.3$ 

- 1.3$ 
0.0 

1.7190 

10.4885 

- 11.8 
0.0 


The sum of the deviations of the plotted points from the 
line is zero. The sum of the deviations when each is multi- 
plied by the corresponding value of x is also zero. The 
accuracy of the actual calculations involved in fitting may 

1 The common fractions are retained in certain columns in order that the sum 
of the deviations may be exactly zero. 
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be tested in this way. The sum of the squares of the devia- 
tions, 10.4885, is a minimum. Any change in the value of 
a or 6 would give a line from which the sum of the squared 
deviations would exceed 10 . 4885. 


FITTING A STRAIGHT LINE; SPECIAL CASES 

The simultaneous solution of the two normal equations 
will give, in any case, the most probable values of a and b. 
The processes of calculation may be simplified in certain 
special cases, not infrequently encountered in handling 
economic data. If the x’s are consecutive numbers, as they 
always are when an unbroken time series is plotted, the 
origin may be taken at the median value. When the number 
of observations is odd this will be the middle item, of 
course. The value of 2 (x) will then be zero, and the normal 
equations become 

2 (y) = na 
2 On/) = 62(x 2 ). 


Thus if a time series extends, by years, from 1901 to 1937, 
the origin may be taken at 1919, the value of x corresponding 
to 1918 being — 1, to 1920, + 1, and so on. The solution 
for values of a and b is rendered much easier when the 
data may be disposed in this way. When there is an even 
number of years the same process is possible, time (the 
x-variable) being measured in units of one half year. 

Again, when the values of x are consecutive positive 
numbers starting at zero, the values of 2(x) and of 2(x 2 ) 
may be easily determined. The sum of the first n natural 

/ n.('n. f-- 1 j 

numbers is equal to ^ • Thus the sum of the numbers 

5(5 4- 1) 

from 1 to 5 is — ~> or 15. This term may replace 2(x) 

in the normal equations. Similarly, the sum of the squares 

2|j3 | j ^ 

of the first n natural numbers is equal to ^ 

Thus the sum of the squares of the numbers from 1 to 5 
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is equal to — - = 55. This expression may replace 

2(r 2 ) in the normal equations, and we have 

Z(y) = na + t (" ( \ +1) ). 

- «( 2 ^) + t ( 2 "‘ + 3 6 " , + ” ). 

It is sometimes easier to work from equations in this form 
than in the form first given. The data for time series may 
be handled in this way, the years being numbered consecu- 
tively, beginning with 1. 

PITTING A CURVE OF THE POWER SERIES 

The discussion above has been confined to the case of 
linear trend. Such a type of curve frequently gives an 
excellent fit, but in many cases it fails accurately to fit the 
data. This difficulty is sometimes overcome in practice 
by breaking a series into segments and fitting a separate 
line to the data for each of these periods. Where there is 
an actual break in the series, the period as a whole lacking 
homogeneity, this practice may be justified, but when the 
period is essentially homogeneous the whole concept of 
secular trend is violated by this process of subdividing 
and fitting separate lines. In many cases where a straight 
line will not fit, a curve of the power series may represent 
the trend accurately. The general process of fitting such a 
curve may be briefly described. 

The generalized form of the equation of the type desired 
is y = a + bx + cx 1 + dx 3 + ... . An equation of this 
form does not, of course, represent a curve of the parabolic 
type, but in ordinary usage that term is applied to the 
potential series. If carried to the second power of £ it is 
called a second degree parabola; if to the third power, a third 
degree parabola, etc. For ordinary purposes such a curve 
should not. be carried beyond the second or third power of x. 
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If carried to the second power there are three unknowns,, and 
three normal equations must be solved simultaneously in 
securing the required values. 

The procedure is similar to that outlined for the linear 
case. Each observation equation is multiplied by the 
coefficient of the first unknown in that equation, and the 
resulting equations are totaled to give the first normal 
equation. The process is repeated for the two other un- 
knowns, and the three normal equations thus obtained are 
solved for a, b, and c. The results are the most probable 
values of these three constants. The following are the 
general forms which the three normal equations take: 

2 (y) = na + 62(x) + c2(:c 2 ). 

2 (xy) = o2(z) + 62(x 2 ) + c2(x 3 ). 

2 {x*y) = a2(x 2 ) + 62(x 3 ) + c2(x 4 ). 

As an example of the process, the calculations involved 
in fitting a second degree parabola to the points 1, 2; 2, 6; 
3, 7; 4, 8; 5, 10; 6, 11; 7, 11; 8, 10; 9, 9 may be outlined. 
It is of the greatest practical importance in curve fitting, 
as in all extensive calculations, that the work be laid out 
and carried on in a definite and systematic fashion, with 
each step definitely related to the preceding and succeeding 
operations. Checks should be introduced wherever possible, 
as mathematical errors creep into even the most careful 
work. A tabular arrangement is generally advisable, each 
operation being revealed and each set of results clearly 
presented. The data in the present problem may be ar- 
ranged as in Table 67. 

When the x's are consecutive integers beginning with 1, 
as in the present case, the values of 2(x), 2(x a ), 2(x 3 ), 
and 2(x 4 ) may be secured from prepared tables. 1 

1 Cf, Table XXVIII, Pearson, Tables for Statisticians and Biometricians . 

Cambridge University Press; Tables D and E, Mills and Davenport, Manual 

of Problems and Tables in Statistics , New York, Henry Holt and Co. Values 

to the third power are given in Appendix Table IX of the present volume. 
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Table: 67 

Computation of Values Required in Fitting a Second Degree Parabola 


X 

y 

xy 

X 2 

x*y 


1 

2 

2 

1 

2 

n = 9 

2 

6 

12 

4 

24 

2(x) = 45 

3 

7 

21 

9 

63 

2(x 2 ) = 285 

4 

8 

32 

16 

128 

2(x 3 ) = 2,025 

5 

10 

50 

25 

250 

2(x 4 ) =15,333 

6 

11 

66 

36 

396 

2(2/) = 74 

7 

11 

77 

49 

539 

2(x2/) = 421 

8 

10 

80 

64 

640 

2(s 2 2/) = 2,771 

9 

9 

81 

81 

729 


45 

74 

421 

285 

2,771 



Substituting these values in the equations given above, 
the following normal equations are secured: 

74 = 9a + 456 + 285c. 

421 = 45a + 2856 + 2,025c. 

2,771 = 285a + 2,0256 + 15,333c. 

When these equations are solved simultaneously the 
following values are secured for the three constants: 

a = - .929. 

6 = +3.523. 
c « - .267. 

The equation of the desired curve is 

y = - .929 + 3.523z - .267x 2 . 

This curve and the nine given points are plotted in Fig. 56 
on page 256. 

If the values of x are consecutive, as in the present 
example, the work of computation is lightened if the mid- 
value is taken as origin. In this case 2(x) and 2(a; J ) are 
equal to zero, and the normal equations become 

2 y — na + c2(x 2 ). 

2 (xy) = 62 (x 2 ). 

2 (x*i/) = a2(x 2 ) + c2(x 4 ). 
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When a third degree parabola of the form y = a + bx 
+ cx i + dx 3 is to be fitted to data, four constants must be 
determined, and four normal equations are necessary. 
These are of the following form: 

2 ( 2 ,) = na + 62(x) + c2(x 2 ) + d2(x s ). 

2 (xy) = a2(x) + 62(x 2 ) + c2(x s ) + d2(x 4 ). 

2(x 2 2/) = a2(x 2 ) + 62 (x 3 ) + c2(x 4 ) + d2(x s ). 

2(x 3 2,) = a2(x 3 ) + 62 (x 4 ) + c2(x s ) + d 2(x»). 


The solution for four or more constants involves a con- 
siderable amount of arithmetical calculation, and there is 
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Points 


some question as to the advisability of representing secular 
trend by equations of this type. With a sufficient number 
of constants a curve may be fitted which will follow every 
variation in the data, but such a curve could hardly be taken 
to represent the long time trend. 1 Minor departures from 

1 Regarding the employment of potential series of the type indicated for 
representing empirical curves, Steinmetz states that their use is justified: 

1. If the successive coefficients a, b, c . . . decrease in value so rapidly that 
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a simple and uniform trend, linear or otherwise, are to be 
expected with economic data, but, if a real trend exists, 
extreme departures from a fairly simple form are rare. If 
such departures are due to pronounced changes in conditions 
no single line of trend is likely to be satisfactory, and it is 
advisable to break the period into parts, with a separate 
line of trend for each part. “Empirical curves,” says 
Steinmetz, “can be represented by a single equation only 
when the physical conditions remain constant within the 
range of the observations.” Though this statement relates 
to the fitting of curves to data from the physical sciences, 
it applies equally well to economic data. 

Determination of the Secular Trend of a 
Business Series 

fitting a straight line 

The procedure of fitting certain types of curves to simple 
data has been illustrated in the preceding sections. Before 
proceeding to a discussion of slightly different forms, it 
will be helpful to examine concrete examples of trend 
determination. We first determine the secular trend of 
a series defining the number of concerns in business in the 
United States, during the period from 1899 to 1914. 1 The 
observations are given in Table 68, together with the values 
required for the fitting of a straight line to the data, and 
the derived trend values. The values of x represent the 
time factor, while the values of y are the corresponding 
numbers of business concerns. Only the entries in columns 

( Footnote 1 continued from page 256.) 

within the range of observation the higher terms become rapidly smaller 
and appear as mere secondary terms. 

2. If the successive coefficients follow a definite law, indicating a convergent 

series which represents some other function, as an exponential, trigono- 
metric, etc. 

3. If all the coefficients are very small, with the exception of a few of them, and 

only the latter ones thus need to be considered. Cf. C. P. Steinmetz 
Engineering Mathematics , New York, McGraw-Hill, 1917, 214-215. 

1 Data compiled by Dun and Bradstreet. 
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(2) to (5), it should be noted, are employed in the fitting 
process. 


Table 68 


Number of Concerns in Business in the United States, 1899-1914 

Computation of values required in fitting line of trend 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Year 

X 

y 

xy 

X* 

Ve 






Trend values 



No, of concerns 



{linear) of no. 



in business, 



of concerns 



in thousands 



in business , 






in thousands 

1899 

1 

1,148 

1,148 

1 

1,152 

1900 

2 

1,174 

2,348 

4 

1,184 

1901 

3 

1,219 

3,657 

9 

1,217 

1902 

4 

1,253 

5,012 

16 

1,250 

1903 

5 

1,281 

6,405 

25 

1,283 

1904 

6 

1,320 

7,920 

36 

1,316 

1905 

7 

1,357 

9,499 

49 

1,349 

1906 

8 

1,393 

11,144 

64 

1,382 

1907 

9 

1,418 

12,762 

81 

1,415 

1908 

10 

1,448 

14,480 

100 

1,448 

1909 

11 

1,486 

16,346 

121 

1,481 

1910 

12 

1,515 

18,180 

144 

1,513 

1911 

13 

1,525 

19,825 

169 

1,546 

1912 

14 

1,564 

21,896 

196 

1,579 

1913 

15 

1,617 

24,255 

225 

1,612 

1914 

16 

1,655 

26,480 

256 

1,645 

Totals 

136 

22,373 

201,357 

1,496 



N 

= 16 

2(1/) 

= 22,373 



2(z) 

= 136 

2 (xy) 

= 201,357 



2(* 2 ) 

= 1,496 





The equations to be solved in determining the required 
constants are of the form 


2 ( 2 ,) = Na + 62 (x) 

2 {xy) = a2(x) + 62(x l ). 

Inserting the given values in the formulas, we have 

22,373 = 16a + 1366 
201,357 = 136a + 1,4966 
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from which 


a = 1,118.66 
b = 32.90. 

The equation to the line of best fit is therefore 
y = 1,118.65 + 32.90* 
with origin at 1898. 

The trend values derived from this equation appear in 
column (6) of Table 68. The original data and line of 
trend are plotted in Fig. 57. The fit for the period covered 



Fig. 57. — Number of Concerns in Business in the United States, 
1899-1914, with Line of Trend 


is good. The number of concerns in business in the United 
States during the sixteen years before the World War is 
well defined by the straight line we have secured. 
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FITTING A POWER CURVE OF THE SECOND DEGREE 

The record of commercial failures in the United States 
over the last forty years provides an example of a series 
following a definitely non-linear trend. Data for the period 
1897-1933 are presented in Table 69, with accompanying 
computations. 


Table 69 


Commercial Failures in the United States, 1897-1933 1 
Computation of values required in fitting line of trend 


(1) 

(2) 

(3) 

(4) 

(5) 

Year 

X 

y 

(No. of 
failures) 

xy 

x*y 

1897 

- 18 

13,351 

- 240,318 

4,325,724 

1898 

- 17 

12,186 

- 207,162 

- 149,392 

3,521,754 

1899 

- 16 

9,337 

2,390,272 

1900 

- 15 

10,774 

- 161,610 

2,424,150 

1901 

- 14 

11,002 

- 154,028 

- 150,995 

2,156,392 

1902 

- 13 

' 11,615 

1,962,935 

1903 

- 12 

12,069 

- 144,828 

1,737,936 

1904 

- 11 

12,199 

- 134,189 

1,476,079 

1905 

- 10 

11,520 

- 115,200 

1,152,000 

1906 

- 9 

10,682 

- 96,138 

865,242 

1907 

- 8 

11,725 

- 93,800 

750,400 

1908 

- 7 

15,690 

- 109,830 

768,810 

1909 

- 6 

12,924 

- 77,544 

465,264 

1910 

- 5 

12,652 

- 63,260 

316,300 

1911 

- 4 

13,441 

15,452 

- 53,764 

215,056 

1912 

- 3 

- 46,356 

139,068 

1913 

- 2 

16,037 

- 32,074 

64,148 

1914 

- 1 

18,280 

- 18,280 

18,280 

1915 

0 

22,156 

0 

0 

1916 

i 

JL 

16,993 

16,993 

16,993 

1917 

2 

13,855 

27,710 

55,420 

1918 

3 

9,982 

29,946 

25,804 

89,838 

1919 

4 

6,451 

103,216 

1920 

5 

8,881 

44,405 

117,912 

222,025 

1921 

6 

19,652 

707,472 

1922 

7 

23,676 

165,732 

1,160,124 

1,197,952 

1923 

8 

18,718 

149,744 

1924 9 

1 Dun and Bradstreet. 

20,615 

185,535 

1,669,815 
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Table 69 — Continued, 

Commercial Failures in the United States, 1897-1933 


(1) 

(2) 

(3) 

(4) 

(5) 

Year 

X 

y 

xy 

x*y 

1925 

10 

21,214 

212,140 

2,121,400 

1926 

11 

21,773 

239,503 

2,634,533 

1927 

12 

23,146 

277,752 

3,333,024 

1928 

13 

23,842 

309,946 

4,029,298 

1929 

14 

22,909 

320,726 

4,490,164 

1930 

15 

26,355 

395,325 

5,929,875 

1931 

16 

28,285 

452,560 

7,240,960 

1932 

17 

31,822 

540,974 

9,196,558 

1933 

18 

19,626 

353,268 

6,358,824 

Totals 


610,887 

+ 1,817,207 

75,307,301 


N = 37 

Z(x 4 ) = 864,690 

2(x) = 0 

2(j/) = 610,887 

Z(z 2 ) = 4,218 

2(*2 /) = 1,817,207 

2(s 3 ) = 0 

2(1*2/) = 75,307,301 

The origin is taken 

at the middle year to facilitate the 


calculations. The values of 2(x 2 ) and 2(x 4 ) may be secured 


from prepared tables, or from the formulas cited on page 254. 

The normal equations to be solved in fitting a second 
degree parabola, with the origin at the middle year of the 
period covered, are of the form 

2 (y) = Na + c2(x 2 ) 

2 (xy) = 62(x 2 ) 

’Z(x 1 y) = aS(x 2 ) + c2(x 4 ). 

Inserting the appropriate values, we have 

610,887 = 37a + 4,218c 
1,817,207 = 4,2186 
75,307,301 = 4,218a + 864,690c. 

Solving for the constants, 


a = 14,827.6 
6 = 439.82 
c = 14.762. 
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The required equation is 

y = 14,827.6 + 430.82a: + 14.762** 
with origin at 1915. 

The original observations and the line of secular trend 



Fig. 58. — Commercial Failures in the United States, 1897-1933, with 

Line of Trend 


are plotted in Figure 58. Observations, trend values and 
deviations from trend are given in Table 70. 

Commercial failures reflect the major cycles in American 
business, but with movements that reverse those of most 
economic series. Failures are numerous in times of depres- 
sion, fewer in prosperity. The reader who will compare the 
deviations from trend shown in Table 70 with the dates 
of reference cycles given on an earlier page will note the 
general agreement. The sharp fall in business failures 
from 1932 to 1933 reflected, of course, the special conditions 
prevailing in the latter year. 
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Table 70 


Commercial Failures in the United States, 1897-1933 
Actual Values, Trend Values, and Deviations from Trend 



Number of 

Trend value, 

Deviation of 

Year 

commercial 

second degree 

actual from trend 


failures 

parabola 

value 

1897 

13,351 

11,855.73 

+ 1,495.27 

1898 

12,186 

11,769.88 

+ 416.12 

1899 

9,337 

11,713.55 

- 2,376.55 

1900 

10,774 

11,686.75 

- 912.75 

1901 

11,002 

11,689.47 

- 687.47 

1902 

11,615 

11,721.72 

- 106.72 

1903 

12,069 

11,783.49 

+ 285.51 

1904 

12,199 

11,874.78 

+ 324.22 

1905 

11,520 

11,995.60 

- 475.60 

1906 

10,682 

12,145.94 

- 1,463.94 

1907 

11,725 

12,325.81 

- 600.81 

1908 

15,690 

12,535.20 

+ 3,154.80 

1909 

12,924 

12,774.11 

+ 159.79 

1910 

12,652 

13,042.55 

- 390.55 

1911 

13,441 

13,340.51 

+ 100.49 

1912 

15,452 

13,668.00 

+ 1,784.00 

1913 

16,037 

14,025.01 

+ 2,011.99 

1914 

18,280 

14,411.54 

+ 3,868.47 

1915 

22,156 

14,827.60 

+ 7,328.40 

1916 

16,993 

15,273.18 

+ 1,719.82 

1917 

13,855 

15,748.29 

- 1,893.29 

1918 

9,982 

16,252.92 

- 6,270.92 

1919 

6,451 

16,787.07 

- 10,336.07 

1920 

8,881 

17,350.75 

- 8,469.75 

1921 

19,652 

17,943.95 

+ 1,708.05 

1922 

23,676 

18,566.68 

+ 5,109.32 

1923 

18,718 

19,218.93 

- 500.93 

1924 

20,615 

19,900.70 

+ 714.30 

1925 

21,214 

20,612.00 

+ 602.00 

1926 

21,773 

21,352.82 

+ 420.18 

1927 

23,146 

22,123.17 

+ 1,022.83 

1928 

23,842 

22,923.04 

+ 918.96 

1929 

22,909 

23,752.43 

- 843.43 

1930 

26,355 

24,611.35 

+ 1,743.65 

1931 

28,285 

25,499.79 

+ 2,785.21 

1932 

31,822 

26,417.76 

+ 5,404.24 

1933 

19,626 

27,365.25 

- 7,739.25 
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The second degree curve employed to define the trend 
of commercial failures does so with reasonable accuracy 
over the period here covered. Extrapolation beyond those 
limits would be hazardous. Indeed, the changed conditions 
under which banking and many other types of business 
were conducted after 1933 may well break the continuity 
of the series, and generate a new long-term trend. 

T his Use of Logarithms in Curve Fitting 

The family of curves described above represents a simple 
and very useful type. Perhaps of even greater general 
utility, in the analysis of time series, are curves of a semi- 
logarithmic type. The advantages of plotting many series 
of data on semi-logarithmic or “ratio” paper were explained 
in an earlier section. A fundamental virtue of this type 
of plotting is that it presents a true picture of relative 
variations, of ratios between magnitudes. Relations of 
this type are ordinarily of primary interest in the analysis 
of economic data, and it is logical that determination of 
trends should proceed on the same basis. 

In doing so, we can make use of a group of curves of the 
same general form as those already described, the one 
difference being that log y takes the place of y throughout. 
That is, the straight line form is log y = a + bx, while the 
general form for the potential series is log y — a + bx + ex 3 
+ dx 3 + ... . The curves secured may be constructed on 
arithmetic paper, plotting the natural ar’s and the logarithms 
of the y’s, or natural values of both x’s and y’s may be 
plotted on semi-logarithmic paper, the logarithmic scale 
extending along the y- axis. The latter is the simpler 
method. 

To illustrate the procedure, the steps involved in fitting 
a curve of the type log y — a + bx will be shown. The 
trend of petroleum production in the United States from 
1922 to 1929 is to be determined. The values needed in 
the normal equations are derived from Table 71. 
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Table 71 


Petroleum Production in the United States, 1923-1929 
(Computation of values required in fitting line of trend) 


Year 

X 

y 

log y 

x • log y 

1922 

1 

557.5 

2.74624 

2.74624 

1923 

2 

732.4 

2.86475 

5.72950 

1924 

3 

713.9 

2.85364 

8.56092 

1925 

4 

763.7 

2.88292 

11.53168 

192G 

5 

770.9 

2.88700 

14.43500 

1927 

6 

901.1 

2.95477 

17.72862 

1928 

7 

901.5 

2 9.5497 

20.68479 

1929 

8 

1,007.3 

3.00316 

24.02528 




23.14745 

105.44203 


N = 8 


2 (log y) = 23.14745 



2(x) = 36 


2(x ■ log y) = 105.44203 



2 (a; 2 ) = 204 

The two normal equations to be solved are of the form 

2 (log y ) = Na + 62a; 

2(a; • log y ) = a2a; + 62a; 2 . 

Substituting the given values we have 

23.14745 = 8a + 366 
105.44203 = 36a + 2046. 

Solving for the constants, 

a = 2.75645 
6 = .03044. 

The equation to the desired curve is, therefore, 
logy = 2.75645 + .03044® 
with origin at 1921. 

In fitting this curve by the method of least squares, as 
is done above, we satisfy the condition that the sum of 
the squares of the logarithmic deviations shall be a minimum. 
That is, the deviations to which this condition relates are 
the differences between the logarithms of the observed 
values and the logarithms of the corresponding trend values. 
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This curve, it should be noted, is not the same as that 
from which the sum of the squares of the arithmetic (natural) 
deviations is a minimum. 

The substitution in the above equation of the value o' 
x representing any given year will enable the logarithm 
of the trend or normal value to be calculated. The trend 
value in natural numbers may then be determined. In 
Table 72 the normal value for each of the years covered 
is given, together with the percentage relation of actual to 
normal. 


Table 72 


Trend, of Petroleum Production in the United Stales, 1922-1929, 
with Comparison of Actual and Trend Values 
(Straight line trend determined from logarithms of production figures) 


Year 

X 

y ( actual ) 
Production 
(in millions 
of bbls.) 

logy c 

Log of trend 

Vc 

(y } computed ) 
Trend value 
(in millions 
of bbls.) 

Percentage rela- 
tion of actual 
to trend 

1922 

1 

557.5 

2 78689 

612.2 

91.1 

1923 

2 

732.4 

2.81733 

656.6 

111.5 

1924 

3 

713.9 

2.84777 

704.3 

101.4 

1925 

4 

763 7 

2.87821 

755.5 

101 1 

1926 

5 

770.9 

2 90865 

810 3 

95.1 

1927 

6 

901.1 

2 93909 

869.1 

103.7 

1928 

7 

901.5 

2.96953 

932.2 

96.7 

1929 

8 

1,007.3 

2.99997 

999.9 

100.7 


The points representing the actual production, together 
with the line of trend, are plotted in Fig. 59. The graph 
of the derived equation gives a good representation of 
the trend in the present instance. 

An equation of this type, defining a linear trend in the 
logarithms of the dependent variable, has certain dis- 
tinctive advantages. The reader will note that this is the 
logarithmic form of an equation to a compound interest 
curve (an exponential curve). This equation was given 
in Chapter II as 
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y = p(l + r )* 
or 

log y - log p + log (1 + r)x. 

In the example just given we have used the symbol a for 
log p and the symbol b for log (1 + r), but the equations 
are identical. 


Millions 
of Barrels 



Fig. 59. — Production of Petroleum in the United States, 1922-1929, 
with Line Defining Average Rate of Growth 

We may readily change to natural numbers the constants 
in the equation defining the trend of petroleum production 
from 1922 to 1929. We have 

log y = 2.75645 + .03044a: 

where 2.75645 is log p and .03044 is log (1 + r). The 
natural number corresponding to 2.75645 is 570.8; the 
natural number corresponding to .03044 is 1.0726. The 
trend of petroleum production in natural form is, therefore, 
y - 570.8(1.0726)* 

with origin at 1921. 

Subtracting 1 from the constant 1 . 0726 we secure . 0726, 
which is r, the rate of increase of a series growing in accord- 
ance with the compound interest law. (If, on subtracting 
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1, we have a negative value, the growth is negative, of 
course.) This measure indicates that the production of 
crude petroleum increased at an average rate of 7.26 per 
cent a year between 1922 and 1929 (r being multiplied by 
100 to place it on a percentage basis). 

When the trend of a series in time may be defined by a 
straight line on ratio paper, and it is surprising how widely 



Fio. 60. — Production of Crude Petroleum in the United States, 
1918-1936, with Line of Trend 


applicable such a function is, the constant r is a highly 
useful measure. It defines the average annual rate of 
growth or decline of the series. It is, of course, an abstract 
measure and thus has the great merit of permitting com- 
parison of the trends of series relating to widely different 
original units. The rate of growth of population, over a 
given period, may have been 1.4 per cent per year; the 
production of gasoline may have increased at a rate of 
4.5 per cent, the production of automobiles at 4.2 per 
cent, the production of wheat at 1 . 1 per cent, total national 
income at 1 . 6 per cent, total national debt at 3 . 2 per cent. 
The trends of these series are immediately comparable, and 
conclusions concerning the direction and character of a na- 
tion’s development may be drawn. This measure provides a 
valuable device for the study of social and economic change. 1 

1 In any extensive application of this procedure time and labor may be 
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Table 73 

Production of Crude Petroleum in the United States, 1918-1936 
Comparison of Actual and Trend Values 
(Trend values determined from second degree parabola fitted to logarithms of 

production figures) 



y ( actual ) 

y { computed ) 

Percentage 

Year 

Production 

Trend value 

relation of 

{in millions 

{in millions 

actual to 


of bbls.) 

of bbls.) 

trend 

1918 

335.9 

345.0 

97.4 

1919 

378.4 

395.5 

95.7 

1920 

442 9 

449.2 

98.6 

1921 

472 2 

505.2 

93.5 

1922 

557.5 

562 8 

99.1 

1923 

732.4 

620.8 

118.0 

1924 

713.9 

678.2 

105.3 

1925 

763.7 

733.8 

104.1 

1926 

770.9 

786.3 

98.0 

1927 

901.1 

834.3 

108.0 

1928 

901.5 

876.8 

102.8 

1929 

1,007.3 

912.4 

110.4 

1930 

898.0 

940.5 

95.5 

1931 

850.3 

960.0 

88 6 

1932 

785.2 

970.4 

80.9 

1933 

905.7 

971.5 

93.2 

1934 

908.1 

963.2 

94.3 

1935 

996.6 

945.7 

105.4 

193G 

1,098.5 

919.6 

119.5 


By the use of additional terms a function of the type just 
discussed may be modified, when dealing with a series 
marked by non-lincar trends on ratio paper. For example, if 
the course of petroleum production be followed over a longer 
period, as is done in Fig. 60, it is obvious that the trend 
line secured for the period 1922-1929 is inappropriate. The 
addition of a third constant gives an equation of the type 
log y = a + bx + cx 2 . 

( Footnote 1 continued from page 268.) 

saved by utilizing Glover\s mean value table (cf. James W. Glover, Tables oj 
Applied Mathematics , George Wahr, Ann Arbor, Michigan, 1923, 468ff.). 
By the use of this table the compound interest curve may be fitted directly 
to the natural numbers. All necessary computations are simply and quickly 
performed. 
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In fitting this to the data of petroleum production for 
the period 1918-1936, we may follow the exact procedure 
used when y was the dependent variable in a similar equa- 
tion (see page 261), except that log y is used throughout, 
instead of y. For the required equation we have 

logy = 2.921331 + . 023660s - .002107a: 2 

with origin at 1927. This is shown graphically in Fig. 60. 
Actual and trend values, in natural terms, are given in 
Table 73 on page 269. 

Other Curve Types 

The two families of curves described in the preceding 
sections meet most of the needs of the economic statistician. 
The trend in most time series may be described by curves 
of the power series, fitted either to natural numbers or 
to the logarithms of the data (that is, to the logarithms 
of the y values; time, the x- variable, is treated in terms of 
natural numbers in fitting both the above types of curves). 
These classes constitute flexible and widely applicable curve 
forms. 1 Attention may be called to several other curve 
types which have been applied less extensively to time 
series, but with favorable results in particular cases. 

Curves of the ordinary parabolic type (y = ax b ) are not 

1 There are available for fitting highar degree curves of the power series 
methods that lessen the labor involved, particularly if curves of different degree 
are to be fitted to the same data. These methods, which reduce the fitting 
process to a series of simple adding machine operations, are appropriate to 
extended research projects. Their use is not advisable, however, unless work 
involving a considerable number of routine operations is contemplated. It is 
desirable that the student master the basic least squares procedures outlined 
in the preceding pages, utilizing other methods only in case extended computing 
tasks are undertaken. 

For accounts of systematic methods suited to extensive calculations, see 
R. A. Fisher, Statistical Methods for Research Workers , Edinburgh, Oliver 
and Boyd, Sixth edition, 1936, 148-156; Max Sasuly, Trend Analysis of Statis- 
tics: Theory and Technique t Washington, Brookings Institute, 1934. The ap- 
plication of the method of orthogonal polynomials described by Fisher is ad- 
mirably exemplified in James W. Angell, The Behavior of Money t New York, 
McGraw-Hill, 1936, 195-202. 
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generally applicable to economic data in the form of time 
series, as their use involves the treatment of the time 
variable as a geometric series. Such a curve, it will be 
recalled, becomes a straight line on double logarithmic 
paper. Yet if a curve of this form serves accurately to 
describe the trend of a given series, its use is justified, 
empirically. 

Such curves may be fitted most readily by employing 
logarithms and using an equation of the linear type. The 
equation 

y = ax b 

becomes, in logarithmic form, 

log y = log a + 6 log x. 

The two normal equations needed in fitting such a curve 
are of the form 

2 (log y) = n log a + 62 (log x) 

2(log x • log y) = log a2(log x) + 62(log x) 2 . 

By substituting the values computed from the data, these 
equations may be solved for log a and b, just as in fitting 
an ordinary straight line. 1 

The equation to the simple exponential curve may be 
written 

y = or*. 

(The r in this equation is the equivalent of 1 + r, as given 
on p. 267.) This equation may be used to define the trend 
of a series increasing or decreasing in geometric progression. 
It has been observed that the trends of economic series 
frequently depart from such a geometric progression by 
constant magnitudes. By adding this magnitude, in a 
given case, to the original series (or subtracting it), a 

1 A useful table of the sums of the logarithms of the natural numbers from 
1 to 100 is included as an appendix to Medical Biometry and Statistics, by 
Raymond Pearl, Philadelphia, Saunders, 1923. 
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modified series with a clear exponential trend may be 
secured. The trend of the original series may be written 

y — K + ar x 

where K is the constant magnitude by which the series 
departs from a geometric progression. A modified exponential 
curve of this type may give a highly satisfactory representa- 
tion of trend, in certain cases. The method employed in 
fitting such a curve is discussed in Appendix D. 

Some use has been made, in the interpretation of eco- 
nomic statistics, of the Gompertz curve, the equation to 
which was originally developed in the actuarial field. The 
equation is 

y — ab cX . 

Its use in the analysis of economic statistics has been based 
upon the argument that there is a general law of growth 
characteristic of population increase, and that this same 
type of growth is found in industries whose products are a 
direct function of the growth of population. 

A somewhat similar curve of growth, the “logistic,” has 
been employed by Verhulst and more recently by Raymond 
Pearl and Lowell J. Reed in forecasting population growth. 
T his curve has been found to describe the trends of cer- 
tain economic series. Examples of the procedures employed 
in fitting Gompertz and logistic curves are given in Appen- 
dix D. 

The Determination of Monthly Trend Values 

The procedures so far described have dealt with annual 
measurements only. Having fitted a line or curve to annual 
data it is frequently necessary to make a transition to 
monthly units. Problems involving such monthly measure- 
ments are faced in the study of cyclical movements which 
are discussed in the next chapter. 

The constant a in the trend equation defines the trend 
value in the year taken as origin. If the annual data 
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employed in the fitting processes are averages of twelve 
monthly values (e.g., the average price of pig iron in 1937) 
the constant a measures the trend value for a month cen- 
tering at the middle of the year covered by the annual 
figures. If the annual data are aggregates of twelve monthly 
values (e.g., total production of pig iron in 1937) the constant 
a must be divided by 12 to obtain the trend value for the 
month centering at the middle of the year. 

If the trend be linear, the constant b in the equation 
y = a + bx defines the change due to trend over a twelve- 
month period. In interpolating for monthly trend values, 
the increment (or decrement) from month to month (e.g., 

from January to February of the year 1937) is if the 

annual data employed in the fitting process are averages 
of monthly values. The increment from month to month is 

if the annual data are aggregates of monthly values. 

The one further step needed is properly to center the 
monthly trend values. These should, of course, be centered 
at points of time corresponding to those to which the 
actual monthly data relate. In averaging, or aggregating, 
monthly data relating to the middle of each of the twelve 
months in a calendar year we secure a figure centering at 
July 1. The month centering at the middle of the year 
of origin thus centers at July 1. For comparison with actual 
monthly data, we desire trend values centering at July 15, 
August 15, etc. At the beginning, therefore, we must add 
to the trend value for the month centering at the middle 

of the year of origin ^that is, to a or to one-half of the 

month-to-month increment (or decrement) that we have ob- 
tained from 6 of the trend equation. This procedure gives us 
the trend value for the month centering at July 15. This value 
may be compared with the actual value recorded for that 
month. The addition to this of the month-to-month trend 
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increment (or decrement) gives trend values for all following 
months; subtraction gives trend values for all preceding 
months. 1 

The Selection of a Curve to Represent Trend 

Various types of curves which may be fitted to represent 
the trend of economic data over a period of time have been 
described. But which of these many types is to be selected 
in a given case? Which will give the best standard of 
normality for each of the years covered? Several references 
to this problem have been made in the preceding sections, 
but no general principles have been laid down. And, in 
fact, no general principles can be evoked to answer this 
fundamental question. There is no absolute test of goodness 
of fit in such cases. It is largely a matter of personal judg- 
ment as to the type of curve which best represents the 
trend in a given instance, and experience must play a 
dominant part in such judgments. But certain general 
considerations are of assistance in selecting the appropriate 
type of curve. 

1. The first step in the selection of a curve type is the 
plotting of the data. When this has been done, it is fre- 
quently possible by inspection to determine the appropriate 
form. The data may be plotted in four different combina- 
tions, of which the first two are of chief importance in 
dealing with economic material. 

a. Natural x, natural y. (That is, plot the given figures on ordi- 

nary arithmetic paper.) 

b. Natural x, log y. (Plot the x’s on the natural scale, and plot 

the y’s on the logarithmic scale; Le., use semi-logarithmic 

paper.) 

1 If the original monthly data relate to the first or last of the month, rather 
than the middle, a similar correction is needed, but the monthly dates named 
in the text would be different, of course. If the trend equation is non-linear, 
the process of interpolation must be correspondingly modified. For a discus- 
sion of appropriate procedures the reader is referred to any treatise dealing 
with the general principles of interpolation. The Calculus of Observations: 
by Whitaker and Robinson, contains an excellent treatment of this topic. 
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c. Natural y 9 log 2. (Plot on semi-logarithmic paper, with the 

x-scale logarithmic.) 

d. Log y } log 2. (Plot on paper with logarithmic ruling on both 

scales.) 

If in any of these cases a straight line trend is secured, 
a type of equation which plots as a straight line under the 
given conditions (cf. Chapter II) would be selected. If a 
linear equation is not appropriate some other simple type 
may be suggested by the plotted data. In studying such 
graphs for the purpose of selecting a curve to represent 
trend, one should be familiar with the curves representing 
all the simpler equations. 

2. The appropriate curve may be determined by a study 
of the relations between the two variables, x and y . In 
the simpler cases the following relations hold: 1 * 

a. If, when the values of x are arranged in an arithmetic series, 

the corresponding values of y form a geometric series, the 
relation is of the exponential type, described by the equa- 
tion 

y = ab x . 

b . If, when the values of x are arranged in a geometric series, the 

corresponding values of y form a geometric series, the rela- 
tion is of the simple parabolic or hyperbolic type, described 
by the equation 

y = ax b . 

c. If, when the values of x are arranged in an arithmetic series, 

the first differences of the corresponding y 9 s are constant, the 
relation is of the straight line type, described by the equa- 
tion 

y = a + bx. 

The differences between successive y values, when s’s are arranged in an 
arithmetic series, are termed “ first differences” or “first order differences” 
and are represented by the symbol Ay. The differences between successive 
first differences are called “second differences” and are represented by the 

1 It will be recalled that an arithmetic series changes by a constant absolute 

increment, while a geometric series changes by a constant percentage. 



276 


MEASUREMENT OF TREND 


symbol A?y. Differences of higher order are similarly derived. The following 
table illustrates the formation of differences: 


x y 
1 11 

2 40 

3 101 

4 206 

6 367 

6 696 

7 905 

6 1,306 

9 1,811 

10 2,432 


A y 

29 

61 

105 

161 

229 

309 

401 

505 

621 


A*y 

32 

44 

56 

68 

80 

92 

104 

116 


A l y 


12 

12 

12 

12 

12 

12 

12 


d. If, when the values of x are arranged in an arithmetic series, 
the nth differences of the corresponding y’s are constant, the 
relation between the variables is described by an equation of 
the potential series carried to the nth power of x; that is, by an 
equation of the type 


y = a -f bx + cx 2 + dx 3 + . . . + qx n . 

Thus, in the example given above, in which the third differ- 
ences are constant, the relation between x and y would be 
described by an equation of the form 


y = a + bx + cx 2 + dx 3 . 


When one is selecting a curve to use in the analysis of 
economic data, he will rarely, if ever, find these tests to 
be met perfectly. This would happen only when the curve 
chosen passed through all the plotted points. But data 
in a given case will generally approximate some one of 
the conditions described above, and the appropriate type 
of curve will be indicated. 

3. If the study of the original data does not render a 
definite decision possible, several types of curves may be 
fitted to the data and the decision made by comparing 
the results. If the equations to the curves being compared 
contain the same number of constants, a comparison of 
the root-mean-square deviations about the curves furnishes 
a conclusive and valid test of the closeness of the fit within 
the limits of the data. 
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The root-mean-square deviation may be readily computed 
by making use of the following relationship 

2(d 2 ) = 2 (y 2 ) - a2(j/) - 62(xy) - c2(x 2 y) ->■ . . • 

where 2 (d 2 ) is the sum of the squares of the deviations 
about the line of trend. (The derivation of this equation is 
explained in Appendix A, in which a generalized form is 
given.) If the equations do not contain an equal number 
of constants, a test of this sort is invalid and the comparison 
can only be made by inspection. Personal judgment as 
to the curve which represents the trend most accurately 
must be the basis of the decision in such cases. 

It should be remembered that the closeness of fit within 
the limits of the data is not of itself a final criterion. An 
equation could be secured, having a number of constants 
equal to the number of points, which would give a curve 
passing through every point plotted, yet such a curve 
would not necessarily represent the trend. The concept 
of a trend is of a regular, smooth underlying movement, 
from which there are deviations, but which marks the long- 
time tendency of the series. In general, therefore, the curve 
should be of simple form, if it is to be consistent with the 
concept of secular trend. This does not mean, however, 
that a complex trend can be represented by a simple curve 
which fails to conform to the plotted data. 

4. An important question to be answered before the 
form of curve can be selected relates to the limits within 
which the line of trend is to be used. If it is to be used only 
within the limits of the plotted data (i.e., for interpolation ) 
one set of considerations governs the choice of a curve. If 
it is to be projected beyond the limits of the data, used as 
a basis for the determination of normal during a subsequent 
period, other considerations enter. In the former case a 
reasonable fit to the data is the sole requirement; in the 
latter case it is necessary, in addition, that the trend of the 
projection be logical, and consistent with the past record. 
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The fact should be clearly recognized that projection, or 
extrapolation, represents a guess, justified only on the 
assumption that a proper line of trend has been fitted and 
that the same conditions that affected the series in the 
past will prevail in the future. A change in conditions, the 
introduction of new elements, renders the projection invalid. 
When dealing with economic statistics, moreover, it is 
ordinarily impossible to tell, except in retrospect, when a 
change has taken place. Conclusions drawn from the pro- 
jection of a line of trend are always subject to error, therefore. 
In practical statistical work such projections are made, 
and are justified on the ground that the most probable 
course in the future is that which prevailed in the past. 
Projections into the distant future are, of course, subject 
to wider margins of error than short-time projections. 
Lines of trend should be revised from time to time, there- 
fore, as new data become available. 

When a projection is to be made, a simple curve with few 
constants is to be preferred to a more complicated one. A 
third or fourth degree parabola may give an excellent fit 
to the data in a given case, but the projection of such curves 
is inadvisable. It is well to remember, as Perrin has pointed 
out, that a curve suitable for interpolation may not be at 
all adapted to extrapolation. 

The avoidance of distortion of trend lines by abnormal 
conditions in the terminal years of the period studied is 
particularly important when a trend is to be projected. 
Reference is made to this point in the next chapter. 

It seems to be true, in general, that simple curves fitted 
to the logarithms of the y’s give more reliable results when 
projected than curves fitted to the natural numbers. In 
an interesting discussion of this point, Karl G. Karsten 1 
argues that phenomena characterized by a uniform rate 
of change are more likely to maintain their trend than 
phenomena marked by a uniform amount of change. It is 

1 Karl Karsten, Charts and Graphs, New York, Prentice Hall, 1923, 423-425. 
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the semi-logarithmic curves, of course, which best measure 
rates of change. 

5. It is frequently true that no one curve will fit a given 
series during the entire period it is desired to study. This 
may be due to changes in conditions which cause the trend 
to be altered. Thus the trend of wholesale prices was 
downward, in a direction well represented by a straight 
line, from the close of the Civil War to 1896. From 1896 
to the beginning of the World War the trend was upward, 
and could be described by a second degree parabola. From 
1921 to 1929 the trend was also curvilinear, rising to 1925, 
declining thereafter. Similar changes occur in many eco- 
nomic series. By breaking the entire period up into sections, 
appropriate lines of trend may be fitted to the several 
periods thus marked off. This process may be carried to 
a quite illogical extreme, however. The concept of trend 
is of a gradual, long-term change, and the breaking up 
of a series in order to fit a number of trend lines is contrary 
to the whole conception. It may be justified upon occasion 
when a real change in conditions occurs, but in all cases 
the attempt should be made to represent the trend during 
the whole period by a single line. 

Deflation as a Step in Analysis 

Many series of economic data are expressed in monetary 
units, in dollars, pounds, or francs. Such series are subject 
to distortion because of changes in the price level. Thus 
the value of heavy engineering contracts awarded in the 
United States in 1913 amounted to approximately 601 
millions of dollars; in 1929 the value of engineering contracts 
awarded in the same territory amounted to approximately 
3,950 millions of dollars. 1 Was the volume of engineering 
construction in 1929 over six times that in 1913? It was 
not. The value of construction contracts awarded in any 
year depends not only upon the actual volume of construe- 

1 Figures compiled by Engineering News Record . 
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tion but also upon the costs of construction materials and 
labor, and these costs increased substantially from 1913 
to 1929. If we wish to measure the change in the volume 
of construction alone, these values must be corrected for 
the increase in construction costs between 1913 and 1929. 
Such a process is termed deflation. 1 

The selection of an appropriate deflating index is the 
central problem in such cases. For the present purpose we 



Fio. 61. — Comparison of Actual and Deflated Values of Contracts 
Awarded in Engineering Construction, 1913-1936 


may use an index of constructive costs, based upon the 
prices of steel, cement, and lumber, and upon wage rates 
in construction industries, compiled by the Engineering 
News Record. This index shows that construction costs 
in 1929 were approximately 107 per cent higher than in 

1 The term deflation is not inappropriate when correction is being made for 
an advance in prices; it is less suitable when correction is made for a fall in 
prices. The period selected as a standard of reference may be one in which 
prices were relatively high; division by a price or cost index resting on such a 
year as base will raise values relating to other periods. The word deflaiion 
is a convenient one to use for this general process, however. In using it in 
this somewhat technical sense we must understand it to mean correcMon for 
changes in the value of the dollar (as measured by specific indices of prices or 
costs). 
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Table 74 


Actual and Deflated Values of Contracts Awarded in Engineering 
Construction, 1913-1936 



Contracts awarded , 


Deflated value of 


engineering con- 

Index of 

contracts awarded 

Year 

struction (monthly 

construction 

(monthly average , 


average , in thou- 

costs 1 

in thousands of 


sands of dollars) 1 


dollars) 

1913 

50,117 

1.000 

50,117 

1914 

48,574 

.886 

54,824 

1915 

48,740 

.926 

52,635 

1916 

77,778 

1.296 

60,014 

1917 

61,592 

1.812 

33,991 

1918 

82,729 

1.892 

43,726 

1919 

97,991 

1.984 

49,391 

1920 

126,923 

2.513 

50,507 

1921 

99,459 

2.018 

49,286 

1922 

129,716 

1.745 

74,336 

1923 

158,670 

2.141 

74,110 

1924 

166,593 

2.154 

77,341 

1925 

213,287 

2.067 

103,187 

1926 

237,820 

2.080 

114,337 

1927 

271,147 

2.062 

131,497 

1928 

298,215 

2.068 

144,205 

1929 

329,193 

2.070 

159,030 

1930 

264,438 

2.029 

130,329 

1931 

202,693 

1.814 

111,738 

1932 

101,609 

1.570 

64,719 

1933 

89,031 

1.702 

52,310 

1934 

113,383 

1.981 

57,235 

1935 

132,513 

1.952 

67,886 

1936 

198,904 

2.065 

96,322 


1913 (the index is 100 for 1913, 207 for 1929). Dividing 
the 1929 aggregate by 2.07, to correct for the change in 
costs, we secure a deflated total of 1,908 millions of dollars. 
This may be taken to measure the aggregate value of 
engineering contracts awarded in 1929, when the 1913 
dollar is used as a standard of value. (In this process the 
value of money is assumed to be held constant with ref- 


1 Data on contracts awarded have been compiled by the Engineering News 
Record ; the index of construction costs has been computed by the same agency. 
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erence to the year which is the base of the price or cost 
index used as a deflator.) If the deflating index may be 
accepted as an accurate measure of changing costs, the 
deflated series may be assumed to define changes in the 
actual volume of engineering construction. The effects of 
changing prices and wages will have been eliminated. 

The general procedure is illustrated in greater detail in 
Table 74 on page 281. Actual and deflated series are plotted 
in Fig. 61. The degree to which changing monetary values 
distorted the construction series may be readily appreciated 
from the diagram. 

Most value series are affected by price changes, and it is 
generally advisable to correct for this factor before further 
analysis is attempted. Each case presents a new problem, 
for no general deflating index is suitable to all series. The 
index of wholesale prices compiled by the United States 
Bureau of Labor Statistics has been used extensively in 
deflating economic data expressed in dollar values, but this 
index is not at all appropriate in many of the cases in 
which it has been employed. It is absurd, for instance, to 
deflate money wages by an index of wholesale prices. The 
deflating index employed should be a measure of price 
changes as they affect the series being deflated. 

The deflation of a value series is in general a first step 
in the study of that series. The way is then open for further 
analysis by methods explained in the present and succeeding 
chapters. 
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CHAPTER VIII 


THE ANALYSIS OF TIME SERIES: MEASUREMENT 
OF SEASONAL AND CYCLICAL FLUCTUATIONS 

The measurement of secular trend is but one of the 
problems connected with the analysis of a series in time. 
Such series, it has been pointed out, are subject to periodic 
fluctuations, seasonal and cyclical in character, and these 
fluctuations are generally more important in their effects 
upon business than is the long-time trend. Our present 
concern is with methods of isolating such periodic variations. 
The series, in Table 75, which clearly reflects the seasonal 
and cyclical swings of domestic trade in the United States, 
may be used to illustrate methods of measuring these 
movements. 


Table 75 


Average Weekly Freight Car Loadings in the United States, 
1918-1927 1 

(Unit: 1,000 cars) 


Month 

1918 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

January 

655 

728 

820 

706 

696 

848 

859 

891 

920 

944 

February 

753 

687 

776 

685 

757 

854 

908 

906 

932 

956 

March 

842 

696 

848 

691 

818 

916 

916 

926 

960 

998 

April 

873 

721 

730 

706 

716 

941 

874 

932 

966 

969 

May 

897 

759 

862 

7(50 

776 

975 

895 

971 

1,018 

1,004 

June 

918 

796 

896 

762 

831 

1,012 

906 

992 

1,052 

1,021 

July 

970 

887 

901 

750 

813 

985 

881 

975 

1,037 

978 

August 

962 

892 

969 

810 

853 

1,042 

969 

1,073 

1,106 

1,073 

September 

956 

960 

967 

842 

925 

1,037 

1,037 

1,074 

1,140 

1,093 

October 

925 

967 

1,005 

932 

978 

1,070 

1,091 

1,107 

1,184 

1,101 

November 

819 

807 

884 

764 

957 

964 

976 

1,024 

1,042 

926 

December 

719 

758 

755 

681 

832 

827 

869 

925 

858 

814 

Average 

857 

805 

868 

757 

829 

956 

932 

983 

1,018 

990 


1 Data from the Annual Bulletin of the American Railway Association and 
the Survey of Current Business. The published figures have been slightly re- 
vised) to take account of calendar variations. 
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For the present purpose the study of seasonal and cyclical 
variations in freight car loadings is limited to the period 
1918-1927. The disturbances of the ensuing period, com- 
bined with changes in railroad operating methods and busi- 
ness practices, materially modified the behavior of this series. 
The demonstration of statistical procedure will be clearer 
if restricted to the relatively homogeneous period here cov- 
ered. 

The Measurement of Seasonal Fluctuations: Moving 

Averages 

Moving averages provide a useful method of defining 
seasonal variations. Since these fluctuations take place 
within a constant period of twelve months, a moving average 
may be used with more confidence than when a cycle of 
varying length is involved. The magnitude of the fluctua- 
tions (the amplitude of the seasonal swings) will not ordinarily 
be constant, hence the line marked out by the moving 
averages will not be completely free of seasonal influences. 
The relation of the actual monthly items to the moving 
averages may be averaged, however, and the indices of 
seasonal variation based upon these averages. 

It is essential, of course, that the moving average, cen- 
tered, fall at the same date as the original figure with 
which it is to be compared. This involves a second process 
of averaging. For example, the weekly averages of freight 
car loadings relate to the middle of each month. The 
average of the twelve monthly items for 1918, when centered, 
falls on July 1st. The average of the items from February, 
1918, through January, 1919, centered, falls on August 1st. 
To secure a figure comparable with the July 16th average, 
these two must be averaged. By this process of computing 
a two-month moving average from the twelve-month aver- 
age, comparability with the original figures is secured. 
Table 76 presents averages obtained in this way for the 
period from July, 1918, to June, 1927. 
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Table 76 

Moving Averages of Freight Car Loadings, 1918-1927 

(12-month moving average, centered, adjusted by 2-month moving average 

centered) 

(Unit: 1,000 cars) 


Month 

1018 

1019 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

Jan. 


808.0 

850.8 

809.6 

783.7 

915.8 

935.9 

957.3 

1,004.8 

1,019.1 

Feb. 


SOI. 7 

854.6 

796.7 

788.1 

930.9 

928.5 

965.6 

1,008.7 

1,015.3 

Mar. 


798.9 

858.1 

. 784.9 

793.4 

943.4 

925.5 

971.5 

1,012.8 

1,012.0 

Apr. 


800.8 

860.0 

776.8 

798.8 

951.9 

926.4 

973.7 

1,018.8 

1,006.5 

May 


802.1 

864.8 

768.8 

808.7 

956.0 

927.8 

976.3 

1,022.8 

998.3 

June 


803.2 

867.9 

760.5 

823.0 

956.1 

930.0 

980.7 

1,020.7 

991.6 

July 

860.5 

808.7 

863.0 

757.0 

835.7 

956.4 

933.1 

984.2 

1,018.9 


Aug. 

860.8 

816.2 

854.5 

759.6 

846.0 

959.1 

934.3 

986.5 

1,020.9 


Sept. 

851.9 

826.3 

844.1 

767.9 

854.2 

961.3 

934.7 

989.0 

1,023.5 


Oct. 

839.5 

833.0 

836.6 

773.6 

867.6 

958.5 

937.5 

991.8 

1,025.2 


Nov. 

827.4 

837.6 

831.3 

774.7 

885.3 

952.4 

943.1 

995.2 

1,024.8 


Dec. 

816.6 

846.1 

821.5 

778.2 

901.1 

944.7 

949.8 

999.7 

1,022.9 



The original data are now expressed as percentages of 
the corresponding moving averages. These percentages 
are given in Table 77. 


Table 77 


Percentage Relation of Actual Freight Car Loadings to 12-Month 

Moving Averages 


Month 

1918 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

Jan. 


90.1 

96.4 

87.2 

88.8 

92.6 

91.8 

93.1 

91.6 

92.6 

Feb. 


85.7 

90.8 

86.0 

96.1 

91.7 

97.8 

93.8 

92.4 

94.2 

Mar. 


87.1 

98.8 

88.0 

103.1 

97.1 

99.0 

95.3 

94.8 

98.6 

Apr. 


90.0 

84.9 

90.9 

89.6 

98.9 

94.3 

95.7 

94.8 

96.3 

May 


94.6 

99.7 

98.9 

96.0 

102.0 

96.5 

99.5 

99.5 

100.6 

June 


99.1 

103.2 

100.2 

101.0 

105.8 

97.4 

101.2 

103.1 

103.0 

July 

112.7 

109.7 

104.4 

99.1 

97.3 

103.0 

94.4 

99.1 

101.8 


Aug. 

111.8 

109.3 

113.4 

106.6 

100.8 

108.6 

103.7 

108.8 

108.3 


Sept. 

112.2 

116.2 

114.6 

109.6 

108.3 

107.9 

110.9 

108.6 

111.4 


Oct. 

110.2 

116.1 

120.1 

120.5 

112.7 

111.6 

116.4 

111.6 

115.5 


Nov. 

99.0 

96.3 

106.3 

98.6 

108.1 

101.2 

103.5 

102.9 

101.7 


Dec. 

88.0 

89.6 

91.9 

87.5 

92.3 

87.5 

91.5 

92.5 

83.9 



These percentages show some variation from year to 
year in the relation of the figures for a given month to 
the moving average. Thus the January figures, while 
always below the average, vary from 87.2 per cent to 
96 . 4 per cent of the average. The nine percentages secured 
for each month must be averaged to obtain the index 
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desired. Either the arithmetic average or the median may 
be employed for this purpose. The results secured by 
applying the two methods are shown in Table 78. In 
columns (2) and (3) the actual arithmetic means and 
medians are given. The average of the twelve arithmetic 
means happens to be exactly 100, so no further adjustment 
is needed. Usually the average will depart in some degree 
from 100, as it does for the medians. When this is the 
case, the twelve monthly index numbers must be adjusted 
to make their average equal to 100. The items in column 
(4) have been secured from the items in column (3) by 
dividing throughout by 1 . 00367. 

Table 78 

Indices of Seasonal Variation in Freight Car Loadings, Computed 
from Moving Averages 


(1) 

(2) 

( 3 ) 

( 4 ) 

Month 

Arithmetic 

Medians 

Medians 

means 

(unadjusted) 

(adjusted) 

January 

91.6 

91.8 

91.5 

February 

92.1 

92.4 

92.1 

March 

95.8 

97.1 

96.7 

April 

92.8 

94.3 

94 0 

May 

98.6 

99.5 

99.1 

June 

101.6 

101.2 

100.8 

July 

102.4 

101.8 

101.4 

August 

107.9 

108.6 

108.2 

September 

111.1 

110.9 

110.5 

October 

115.0 

115.5 

115.1 

November 

101.7 

101.7 

101.3 

December 

89.4 

89 6 

89.3 

Average 

100.0 

100.367 

100.0 


The Computation of Index Numbers of Seasonal 
Variation by Averaging Ratios to Trend 

A somewhat similar method of securing seasonal indices, 
which has certain distinctive advantages, involves the 
averaging of ratios to trend. 1 In the application of this 

1 The essentials of this method were worked out independently by Helen D, 
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method, a suitable line of trend, linear or non-linear, is 
fitted to the data, the actual monthly items are expressed 
as percentages of the corresponding trend figures, and then, 
for each month, an average of the percentage ratios of 
the actual to the trend values is secured. This procedure 
is identical with that described in connection with the 
use of moving averages, except that the actual values may 
be expressed as percentages of normal values derived from 
any function employed to represent trend. In the selection 
of an average value for each month, use may be made of 
a multiple frequency table in obtaining an understanding 
of the nature of the actual seasonal movement. With the 
help of such a table the existence of a definite seasonal 
movement may be verified and the type of average to be 
used in securing a typical value for each month may be 
determined. (It would, of course, be equally appropriate 
to use a table of this type in connection with the method 
of moving averages.) We shall apply this method to the 
data employed in the preceding examples. 

A straight line, fitted to annual averages of the data of 
freight car loadings from 1918 to 1927, as given in Table 75, 
is described by the equation 

y = 769.00 + 23.727a: 

with origin at July 1, 1917. Normal values for each month 
may be computed readily. 1 The normal value for the 
month centering at July 1, 1917, is 769.00 (i.e., the constant 
a of the trend equation). Since the increment over a twelve- 
month period is 23.727, the increment from month to 
month is one twelfth of this, or, 1 . 977. Hence the normal 
value for the month centering at January 1, 1918, is 769.00 

( Footnote 1 continued from page 287.) 

Falkner "The Measurement of Seasonal Variation/' Journal of the American 
Statistical Association , June, 1924, 167-179, and Lincoln W. Hall, "Seasonal 
Variation as a Relative of Secular Trend,” Journal of the American Statistical 
Association , June, 1924, 156-166. 

1 Methods used in the determination of monthly trend values are discussed 
in Chapter VII. 
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4- (6 X 1.977), or 780.862. But the average weekly freight 
car loadings for January, 1918, must be taken to center 
at January 15th. The monthly trend value centering at 
that date is 780.862 + $(1,977), 0 r 781.850. The trend 
value for February, 1918, is secured by adding to 781.850 
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Fia. 62 . — Frequency Distributions: Monthly Freight Car Loadings 
Expressed as Relatives of Corresponding Trend Values 


the monthly increment, 1,977. A similar process gives 
the value for each succeeding month. The results, rounded 
off to the nearest whole number, are given in column (2) 
of Table 80. 

Expressing each of the given values for each month as a per- 
centage of the corresponding trend value, we secure ten 
such relative figures (since the data cover ten years). The 
ten January percentages vary from 79.4 to 98.9, the ten 
October percentages from 107 . 0 to 119.7, etc. The multiple 
frequency table which appears in Fig. 62 is constructed 
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by classifying, in the form of a frequency distribution, the 
items for each month. 

The presence of a distinct seasonal variation is dem- 
onstrated by this table. Freight traffic is consistently 
low in the winter months. Activity is somewhat greater 
in the spring, and reaches a peak as a result of harvesting 
and other demands in the late summer and fall. 

The tabular summary facilitates the selection of a type 
of average for the measurement of the seasonal movements. 
The median is likely to be unrepresentative; it is subject 
to material change in value by the addition or withdrawal 
of one or two entries, unless there is a definite concentration 
in the monthly frequency distributions. The arithmetic 
mean of all the items, on the other hand, may be unduly 
affected by exceptional cases. An alternative method is 
provided by the possibility of taking the arithmetic mean 
of the central items for each month. If an inspection of 
the multiple frequency table does not lead to an immediate 
decision as to which is the best type of average to employ 
in a given case, several index numbers may be worked 
out for each month, and a decision reached after a compari- 
son of the results. (Indeed, since the determination of a 
typical value is a separate problem for each month, the 
method of averaging employed might vary from month 
to month.) In the present instance the seasonal variation 
is fairly regular, year after year. No great differences 
would appear in the results secured by averaging varying 
numbers of items. Index numbers based upon averages 
of the four central items for each of the twelve months 
are appropriate in this case. (In general, an average of 
three, four, or five central values is more likely to be stable 
and representative than either the median or an average 
of all the items for each month. The greater the concentra- 
tion in the monthly frequency tables, the smaller the number 
of items upon which the index numbers may be based.) 

The simple averages of the four central items constitute 
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the unadjusted index numbers given in Table 79. Correcting 
these so that the average of each group is equal to 100, 
we secure the adjusted index numbers presented in the 
same table. (These averages have been derived, not from 
the frequency distributions shown in Fig. 62, but from 
individual percentages defining the relation of actual to 
trend values.) 


Table 79 


Indices of Seasonal Variation in Freight Car Loadings, Based 
upon Percentage Ratios of Actual Values to Linear 
Trend Values 



Unadjusted 

Adjusted 


index numbers 

index numbers 

Month 

(based upon four 

(based upon four 


central items) 

central items) 

January 

92.9 

91.6 

February 

94.8 

93.5 

March 

98.6 

97.2 

April 

94.3 

93.0 

May 

100.2 

98.8 

June 

102.4 

101.0 

July 

102.8 

101.3 

August 

109.7 

108.2 

September 

112.3 

110.7 

October 

115.6 

114.0 

November 

103.6 

102.1 

December 

89.9 

88.6 

Average 

101.425 

100.0 


The index numbers of seasonal variation derived from 
ratios to trend accord very closely with those computed 
from moving averages. The widest discrepancy, for the 
month of February, amounts to only 1 . 4. The consistency 
of the seasonal movement in freight car loadings helps 
to explain this close agreement. In general the two methods 
here exemplified will yield results that are fairly close 
together. Both are well adapted to the measurement of 
seasonal changes in homogeneous series. Simpler methods 
may be used on occasion, and more involved methods may 
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be required in dealing with non-homogeneous series where 
there is reason to suspect that the pattern of seasonal 
movements has been modified during the period under 
observation. 

Modifications of these general procedures are necessary 
when the pattern of the seasonal movements in a given 
series is altered dining the period under observation. Two 
types of shifts in seasonal variation may be distinguished. 
The first includes shifts that are irregular over time, but 
that are related to definable causal factors. Thus the price 
of an agricultural product may follow one seasonal pattern 
in years of high production, and quite a different pattern 
in years of low production. 1 Where this condition prevails 
it may be possible to compute two sets of seasonal indices, 
each to be applied under appropriate conditions. Methods 
already described may be used in the construction of such 
indices. Of this irregular type, also, are alterations in 
the seasonal pattern of an economic series that reflect sharp 
changes in business practice. Shifts in the dates of the 
annual automobile shows in the United States have mater- 
ially altered the seasonal index of automobile sales. 

The second type of seasonal modification is progressive 
in character. The change in pattern is not sudden, nor 
does it reverse itself. Slow alterations over time in trade 
practices and consumption habits bring such evolutionary 
or secular changes. The slow displacement of the open 
car by the closed car brought such a progressive modification 
in the seasonal pattern of automobile sales. In the computa- 
tion of seasonal indices under these conditions persistent 
changes over time in the figures for each month may be 
measured separately. Thus, when ratios to trend have 
been obtained, all the January items (such as those plotted 
in Fig. 62) may be plotted chronologically. The progressive 
change in the January relatives from 1920 to 1937, say, 
is then defined by a line of secular trend. The trend value 
1 See F. L. Thompson, Agricultural Prices , New York, McGraw-Hill, 1936. 
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for January of 1920 is a first approximation to the January 
seasonal index for 1920. The figure for February of 1920 
is obtained in the same way, and so for each month of 
1920. Adjustment of these preliminary values to make 
their average equal to 100 gives a set of seasonal indices 
for 1920. Seasonal indices for other years are computed 
in the same way. 

This method is, of course, more laborious than the pro- 
cedure followed when the seasonal pattern remains con- 
stant. Before applying the more complicated method the 
investigator should assure himself that the shift in pat- 
tern is real, and not merely a reflection of accidental varia- 
tions. 1 

The Measurement of Cyclical Fluctuations 

There remains the task of combining the corrections for 
secular trend and seasonal variation in order to secure 
measures of cyclical changes in a given series. Major 
interest in most economic studies attaches to these cyclical 
changes, and the measurement of such changes is usually 
the central problem in the analysis of time series. The 
complete elimination of all non-cyclical movements is impos- 
sible, of course. We must content ourselves with measures 
reflecting cyclical changes intermingled in rather uncertain 
proportions with accidental fluctuations. 

The procedure may be illustrated with reference to the 
data of freight car loadings in the United States, presented 
in Table 75. For the purposes of the present illustration 
the study will be restricted to the decade 1918-1927. The 

1 Tests of sampling errors are discussed in Chapters XIV, XV, and XVIII. 
The test of a linear trend in this case would relate to the slope b of the line 
fitted to the relatives for a given month. 

The literature on the measurement of seasonal fluctuations is extensive. 
The references at the close of this chapter contain detailed accounts of various 
modifications of the basic procedures discussed above. A rapid, flexible 
and accurate graphic method, suitable for use by the student who has grasped 
the essentials of the formal procedures, is explained in the article by William A. 
Spurr. Spurr’s method utilizes relative (logarithmic) deviations, a procedure 
for which there is strong logical justification. 
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severe disturbances that occurred during the business cycle 
that ran its course between 1927 and 1933, and in the years 
immediately following, greatly complicate the task of disen- 
tangling the secular, seasonal, and cyclical elements in the 
behavior of this series. Not until a somewhat longer period 
has intervened will it be possible to determine the contribu- 
tions a chan g in g secular trend and changing seasonal move- 
ments may have made to the fluctuations in railway freight 
traffic during the decade 1927-1937. 

In attempting to separate the results of secular, seasonal, 
cyclical, and random movements in the behavior of time 
series, it is well to establish a series of “expected” values, 
representing results of the operation of regularly acting 
forces. Most regular and predictable of the forces affecting 
time series are those defined as secular and seasonal. The 
equation to the line of secular trend of freight car loadings 
provides a means of estimating annual and monthly values. 
These would be the “expected” values were the forces of 
trend alone in operation. But we know that a seasonal 
movement, regular enough for fairly exact measurement, is 
superimposed upon the trend. The combination of the 
results of these two forces provides a basic series of “ expected 
values,” from which deviations due to the play of other 
forces may conveniently be measured. 

A process suitable to this purpose is illustrated in Table 80. 
In col. (2) we have the monthly trend values of freight 
car loadings, and in col. (3) index numbers of seasonal 
variation. The products of the two, constituting the series 
of “expected values,” are given in col. (4). Thus, for Janu- 
ary, 1918, the expected number of freight cars loaded is 
not 782, the trend value, but 782 X . 916, the latter figure 
•being the seasonal index for January. This correction 
gives an “expected” number of 716. Subtracting from the 
actual values in col. (5) the corresponding expected values, 
we obtain the measurements in col. (6). The 655 cars 
loaded in January, 1918. fell short by 61 of the “expected” 
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Table 80 


Illustrating the Analysis of a Series in Time 
Freight Car Loadings, 1918-1927 
(1) (2) (3) (4) (5) (6) 

Trend Deviation of ‘ 

Year m j Seasonal . , A . 7 actual value 

! Trend . , corrected Actual , ^ c 

and i index , 7 from trend 

—a <«»*•> Jl* i 

seasonal 


1918 
Jan. 
Feb. 
Mar. 
Apr. 
May 
June 
July 
Aug. 
Sept. 
Oct. 
Nov. 
Dec. 

1919 
Jan. 
Feb. 
Mar. 
Apr. 
May 
June 
July 
Aug. 
Sept. 
Oct. 
Nov. 
Dec. 

1920 
Jan. 
Feb. 
Mar. 
Apr. 
May 
June 




( 7 ) 

Percentage 
deviation of 
actual value 
from l trend 
corrected for 
seasonal * 


A -TS 

A — TS 

TS 

- 61 

- 8.5 

+ 20 

+ 2.7 

+ 78 

+ 10.2 

+ 140 

+ 19.1 

+ 116 

+ 14.9 

+ 118 

+ 14.8 

+ 166 

+ 20.6 

+ 101 

+ 11.7 

+ 73 

+ 8.3 

+ 13 

+ 1-4 

0 

0 

+ 7 

+ 0.98 

- 10 

- 1.4 

- 68 

- 9.0 

- 91 

- 11.6 

- 34 

- 4.5 

- 44 

- 5.5 

- 27 

- 3.3 

+ 59 

+ 7.1 

+ 6 

+ 0.7 

+ 51 

+ 5.6 

+ 29 

+ 3.1 

- 35 

- 4.2 

+ 25 

+ 3.4 

+ 61 

+ 8.0 

- 1 

- 0.1 

+ 38 

+ 4.7 

- 47 

- 6.0 

+ 35 

+ 4.2 

+ 49 

+ 5.8 
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Table 80 — Continued 


( 1 ) 


1920 
July 
Aug. 
Sept. 
Oct. 
Nov. 
Dec. 

1921 
Jan. 
Feb. 
Mar. 
Apr. 
May 
June 
July 
Aug. 
Sept. 
Oct. 
Nov. 
Dec. 

1922 
Jan. 
Feb. 
Mar. 
Apr. 
May 
June 
July 
Aug. 
Sept. 
Oct. 
Nov. 
Dec. 

1923 
Jan. 
Feb. 
Mar. 
Apr. 
May 


Illustrating the Analysis of a Series in Time 


(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

A — TS 

T 

S 

TS 

A 

A - TS 

TS 

841 

1.013 

852 

901 

+ 49 

+ 5.8 

843 

1.082 

912 

969 

+ 57 

+ 6.3 

845 

1.107 

935 

967 

+ 32 

+ 3.4 

847 

1.140 

966 

1,005 

+ 39 

+ 4.0 

849 

1.021 

867 

884 

+ 17 

+ 2.0 

851 

.886 

754 

755 

+ 1 

+ 0.1 

853 

.916 

781 

706 

- 75 

- 9.6 

855 

.935 

799 

685 

- 114 

- 14.3 

857 

.972 

833 

691 

- 142 

- 17.0 

859 

.930 

799 

706 

- 93 

- 11.6 

861 

.988 

851 

760 

- 91 

- 10.7 

863 

1.010 

872 

762 

- 110 

- 12.6 

865 

1.013 

876 

750 

- 126 

- 14 4 

867 

1.082 

938 

810 

- 128 

- 13.6 

869 

1.107 

962 

842 

- 120 

- 12.5 

871 

1.140 

993 

932 

- 61 

- 6.1 

873 

1.021 

891 

764 

- 127 

- 14.3 

875 

.886 

775 

681 

- 94 

- 12.1 

877 

.916 

803 

696 

- 107 

- 13.3 

879 

.935 

822 

757 

- 65 

- 7.9 

881 

.972 

856 

818 

- 38 

- 4.4 

883 

.930 

821 

716 

- 105 

- 12.8 

885 

.988 

874 

776 

- 98 

- 11.2 

887 

1.010 

896 

831 

- 65 

- 7.3 

889 

1.013 

901 

813 

- 88 

- 9.8 

891 

1 082 

964 

853 

- Ill 

- 11.5 

893 

1.107 

989 

925 

- 64 

- 6.5 

895 

1.140 

1,020 

978 

- 42 

- 4.1 

897 

1.021 

916 

957 

+ 41 

+ 4.5 

899 

.886 

797 

832 

+ 35 

+ 4.4 

900 

.916 

824 

848 

+ 24 

+ 2.9 

902 

.935 

843 

854 

+ 11 

+ 1.3 

904 

.972 

879 

916 

4* 37 

+ 4.2 

906 

.930 

843 

941 

+ 98 

+ 11.6 

908 

.988 

897 

975 

+ 78 

+ 8.7 
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Table 80 — Continued 

Illustrating the Analysis of a Series in Time 

(1) (2) (3) (4) (5) (6) . (7) 

A TS 

T S TS j A A -TS TS - 

1923 

June 910 1.010 919 1,012 + 93 + 10.1 

July 912 1.013 924 985 + 61 + 6.6 

Aug. 914 1.082 989 1,042 + 53 + 5.4 

Sept. 910 1.107 1,014 1,037 + 23 + 2.3 

Oct. 918 1.140 1,047 1,070 + 23 + 2.2 

Nov. 920 1.021 939 964 + 25 + 2 7 

Dec. 922 .886 817 827 + 10 + 1.2 

1924 

Jan. 924 . 916 846 859 + 13 + 1.5 

Feb. 926 .935 866 908 + 42 + 4.8 

Mar. 928 .972 902 916 + 14 + 1.6 

Apr. 930 .930 865 874 + 9 + 1.0 

May 932 .988 921 895 - 26 - 2.8 

June 934 1.010 943 906 - 37 - 3.9 

July 936 1.013 948 881 - 67 - 7.1 

Aug. 938 1.082 1,015 969 - 46 - 4.5 

Sept. 940 1.107 1,041 1,037 - 4 - 0.4 

Oct. 942 1.140 1,074 1,091 + 17 + 1.6 

Nov. 944 1.021 964 976 + 12 + 1.2 

Dec. 946 .886 838 869 + 31 + 3.7 

1925 

Jan. 948 . 916 868 891 + 23 + 2.6 

Feb. 950 .935 888 906 + 18 + 2.0 

Mar. 952 . 972 925 926 + 1 + 0.1 

Apr. 954 . 930 887 932 + 45 + 5.1 

May 956 .988 945 971 + 26 + 2.8 

June 958 1.010 968 992 + 24 + 2.5 

July 960 1.013 972 975 + 3 + 0.3 

Aug. 962 1.082 1,041 1,073 + 32 + 3.1 

Sept. 964 1.107 1,067 1,074 + 7 + 0.7 

Oct. 966 1.140 1,101 1,107 + 6 + 0.5 

Nov. 968 1.021 988 1,024 + 36 + 3.6 

Dec. 970 . 886 859 925 + 66 + 7.7 

1926 

Jan. 972 . 916 890 920 + 30 + 3.4 

Feb. 974 .935 911 932 + 21 + 2.3 

Mar. 976 . 972 949 960 + 11 + 1.2 

Apr. 978 . 930 910 966 + 56 + 6.2 

May 980 . 988 968 | 1,018 1+50 + 5.2 
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Table 80 — Continued 


Illustrating the Analysis of a Series in Time 


(1) 

(2) 

T 

(3) 

S 

(4) 

TS 

(5) | 
A 

(6) 

A — TS 

A_ 

(7) 

- TS 
TS 

1926 








June 

982 

1.010 

992 

1,052 

+ 

60 

+ 

6.0 

July 

984 ; 

1.013 

997 

1,037 

+ 

40 

+ 

4.0 

Aug. 

986 

1.082 

1,067 

1,106 

+ 

39 

+ 

3.7 

Sept. 

987 

1.107 

1,093 

1,140 

+ 

47 

+ 

4.3 

Oct. 

989 

1.140 

1,127 

1,184 

+ 

57 

+ 

5.1 

Nov. 

991 

1.021 

1,012 

1,042 

+ 

30 

+ 

3.0 

Dec. 

993 

.886 

880 

858 

— 

22 

— 

2.5 

1927 









Jan. 

995 

.916 

911 

944 

+ 

33 

+ 

3.6 

Feb. 

997 

.935 

932 

956 

+ 

24 

+ 

2.6 

Mar. 

999 

.972 

971 

998 

+ 

27 

+ 

2.8 

Apr. 

1,001 

.930 

931 

969 

+ 

38 

+ 

4.1 

May 

1,003 

.988 

991 

1,004 

+ 

13 

+ 

1.3 

June 

1,005 

1.010 

1,015 

1,021 

+ 

6 

+ 

0.6 

July 

1,007 

1.013 

1,020 

978 

— 

42 

— 

4.1 

Aug. 

1,009 

1.082 

1,092 

1,073 

— 

19 

— 

1.7 

Sept. 

1,011 

1.107 

1,119 

1,093 

— 

26 

— 

2.3 

Oct. 

1,013 

1.140 

1,155 

1,101 


54 

— 

4.7 

Nov. 

1,015 

1 021 

1,036 

926 

j — 

110 

— 

10.6 

Dec. 

1,017 

.886 

901 

814 


87 

- 

9.7 


number, 716. Such deviations of actual values from "trend 
corrected for seasonal” represent the combined influence 
of cyclical and accidental factors. These may be utilized 
in the absolute form given in col. (6), or may be expressed 
in percentage terms as in col. (7) of Table 80. 

The series defining trend values corrected for seasonal 
variations, which are given in cols. (6) and (7) of Table 80, 
furnish the most satisfactory bases from which cycles in 
economic series may be measured. It is true that the 
"cycles” in cols. (6) and (7) are distorted by accidental 
fluctuations, but there is no simple means by which these 
may be eliminated. Recognizing their presence, the series 
may be put to fruitful use in the study of cyclical movements. 1 

1 A series cf “corrected deviations from trend ’’ may be secured by subtract- 
ing the indices oi seasonal variation from a series in which actual values are 
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The analysis of this series may be followed through 
graphically in Figs. 63 and 64 on page 300. The actual data of 
freight car loadings, by months from 1918 to 1927, are plotted 
in Fig. 63, together with a straight line of trend. In 
addition, a series of expected values (the figures in col. [4] 
of Table 80) is given for comparison with the actual. In 
this chart the seasonal pattern, shown by the dotted line, is 
superimposed upon the trend. Fig. 64 shows the deviations 
of actual from expected values, in percentage terms. These 
constitute the “cycles” in freight car loadings. As we have 
noted, random elements as well as cyclical fluctuations proper 
are present in these deviations. It would be possible, by 
using three- or five-month moving averages on these devia- 
tions, or by other smoothing processes, to eliminate some of 
the effects of the accidental movements. But the random and 
the cyclical movements are so closely interwoven that the 
attempt at separation is not generally made. 

If cyclical changes in this series are to be compared with 
similar changes in other series, it is desirable to reduce the fig- 
ures to a form permitting such comparison. The percentage 
deviations might be much more violent in one series than in 
another, and without a common denominator comparison 
would be difficult. This common denominator is afforded by 
the standard deviation. The monthly or annual deviations 
may be expressed in terms of the standard deviation as the 
unit of measurement, if such comparison is to be made. 


(Footnote 1 continued from page 298.) 

A 

given as percentages of corresponding trend values. That is, — — S may be 


employed, instead of 


TS 


TS 


This usage, which involves the assumptions 


that the “cyclical-accidental” composite and seasonal variations both repre- 
sent deviations from trend as base and that their influences are additive, is 
not as strong, logically, as the method exemplified in the text. Trend and 
seasonal forces are the constant factors in the behavior of time series. In 
combination they may be thought of as providing the base from which cycli- 
cal and accidental movements occur, as deviations. (This is a convenient, and 
perhaps not a faulty, conception. We do not, however, possess knowledge of 
the true organic relations among the elements of time series.) 
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The process of analysis has now been completed. We 
have, for the given series, the equation to the line of secular 
trend, and from this the normal or trend value at any 
given date may be computed. The seasonal variations have 
been measured, and indices of these variations computed. 
Finally, the cyclical fluctuations (plus the unmeasurable 
random and accidental changes) have been isolated. These 
measurements of cyclical fluctuations may be used in 
studying the sequences of change in different economic 
series during business cycles, in comparing economic series 
in respect of the amplitude or duration of their cyclical 
movements, and in various other ways in the analysis of 
business cycles and the planning of business operations. 
Some of these applications are discussed in later sections. 

General Considerations 

Certain considerations not specifically mentioned above 
should be borne in mind in subjecting time series to the 
type of analysis described in this chapter. It is essential 
that the data employed be homogeneous, as regards sources, 
methods of quotation, coverage, etc. In addition, homo- 
geneity in the conditions underlying the behavior of the 
particular series which are the objects of study is assumed. 
Homogeneity, as the term is here used, may not be defined 
in absolute terms. New factors are constantly being inter- 
jected into economic and social life. Homogeneity cannot 
be taken to mean static conditions. Yet the change must 
be orderly and, as regards major movements, reasonably 
continuous if the kind of analysis here discussed is to yield 
results. Abrupt dislocations that suddenly alter prevailing 
trends and existing seasonal patterns break the necessary 
homogeneity of statistical series. If the forces that caused 
these dislocations persist, and operate in orderly fashion, 
we mark a break in our series and subject the new period 
to analysis in its turn. 

For the determination of a line of trend and the calcula- 
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tion of indices of seasonal variation, data extending over 
as long a period as possible should be employed (subject 
to the preceding qualification regarding fundamental dis- 
continuities). Ten years may be suggested as a minimum 
period, though a much longer term of years is desirable. 
If interest attaches to cycles of long duration, rather than 
to the short-period business cycles with which the preceding 
account is concerned, our concept of trend, as well as that 
of cycles, must be modified. The minimum time period 
suitable for study must be correspondingly lengthened. 

If a relatively short term of years is employed in the 
determination of trend, it is important that the terminal 
years be neither exceptionally high nor exceptionally low, 
as a result of cyclical or accidental movements. In general, 
the cyclical movements in the terminal years should be 
in “symmetrical phases,” in Crum’s phrase. Thus a cyclical 
rise at the beginning of the period should be balanced by 
a cyclical decline at the end. 

It is logically improper to make correction for assumed 
seasonal movements in a time series unless the existence 
of true seasonal variations has been established. That is, 
a test should be applied to determine whether the observed 
departures of the various monthly indices from their aver- 
age value (100) are attributable to the play of chance, or 
whether a true seasonal pattern is present. The basis of 
such tests of significance is discussed in Chapters XIV 
and XVIII, and a method appropriate to the present problem 
is developed in Chapter XV. 

In fitting a line of trend, computing indices of seasonal 
variation and deriving, finally, a set of residual figures 
which are taken to reflect the cyclical fluctuations in an 
economic series we are, of course, abstracting from reality. 
As in all such abstractions, caution is necessary. Assump- 
tions implicit in the various steps taken are likely to be 
forgotten. Thus the “cycles” plotted as deviations in Fig. 64 
are distorted not only by the random and irregular fluctua- 
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tions to which attention has already been called. To the 
extent that the trend is inadequately or inaccurately defined 
by the particular function used, residual errors are present 
in the deviations. To the extent that seasonal movements 
are inaccurately measured by the seasonal indices employed, 
other residual errors are present. And if the trend is pro- 
jected beyond the period covered by the fitting process, 
or if seasonal indices are used for periods not included in 
their calculation, new sources of possible error are intro- 
duced. The “cycles” that appear so definite and clear-cut 
in our tables may contain more than traces of many non- 
cyclical elements. It is often desirable to employ methods 
of analysis that carry us far from the original observations, 
but the dangers of misinterpretation and error are multiplied 
as we abstract from the reality of economic processes and 
business operations. 

The methods of time series analysis described in this 
and the preceding chapter are adapted to a variety of 
economic and business purposes. But they do not constitute 
the only means of attack, in dealing with series ordered 
in time. Special problems may necessitate the use of some- 
what more elaborate procedures. 1 For some purposes simpler 
methods will suffice. For other purposes it may be invalid 
to attempt to isolate and measure separately the influence 
of secular, seasonal, and cyclical forces. Economic science 
has yet to determine the precise nature of the interrelations 
among these categories of forces. In the light of this fact 
the discerning statistician will adapt his methods to the 
requirements of individual problems, as they arise. 
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CHAPTER IX 


INDEX NUMBERS OF PHYSICAL VOLUME 

Comprehensive and accurate records of physical pro- 
duction are of central importance to business interests, to 
government, and to economists. The appraisal of the mar- 
ket and the intelligent planning of production programs 
require knowledge of past production trends and present 
conditions. The credit policies of banking authorities and 
monetary policies of federal agencies are determined in 
good part with reference to the physical volume of goods 
being produced and marketed. The phases of business 
cycles are probably traced with more accuracy by produc- 
tion movements than by changes in any other economic 
element. The directions in which the productive efforts 
of an economy are being exerted are defined by records of 
the output of goods of different classes, such as capital 
goods and consumption goods. Changes in the course of 
years in the true standard of living of a nation must be 
measured in terms of the aggregate of physical goods pro- 
duced. 

The last twenty years have witnessed notable enlarge- 
ments of the scope and improvements in the accuracy of 
measurements of production in the United States. Efforts 
of federal agencies, private organizations, and trade associa- 
tions have combined to provide materially better statistics 
of output in agriculture, mining, and manufacture. More 
recently records of the volume of trade have been broadened 
and made more accurate. There are important gaps still, 
particularly as regards the output of finished, highly fabri- 
cated goods not easily enumerated in units of constant 
quality. But the statistics we have provide a full and 
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reasonably accurate record of monthly and annual move- 
ments of production. 

Here, again, we face the problem of combining series 
relating to individual commodities. For scattered data on 
the output of oats, coal, gasoline, pig iron, automobiles, 
etc. do not define the general changes in output that are 
of interest to persons concerned with the larger aspects 
of economic change. He who would study the course of 
general production encounters a problem much like that 
presented to the student of general price movements. If 
the general trend of production is to be determined, or 
if the cyclical or seasonal swings of production are to be 
studied, the mass of individual figures must be reduced to 
the form of a single index, the significance of which may 
be easily comprehended. The present chapter deals with 
methods appropriate to the construction of such indices. 

Index Numbers of Production Unadjusted for Trend 
and Seasonal Movements 

An immediate and obvious obstacle to the combination 
of measures of output for different industries arises from 
differences in the units employed. Since bushels, tons, and 
gallons may not be added directly, the simple aggregative 
type of index is ruled out. One method of overcoming this 
difficulty is to reduce to relative terms the several output 
series that are to be combined. A relative number measuring 
the output of petroleum in 1936 as a percentage of output 
in 1922 may be averaged with similar relatives for bituminous 
and anthracite coal. The average may be a simple one, 
or the relatives for the several commodities may be weighted 
in proportion to the importance of the commodities in 
question. This procedure was illustrated in detail in the 
opening pages of Chapter VI. 

An alternative method is to employ an index of the 
weighted aggregative type, keeping prices constant as be- 
tween two periods being compared. In 1917, according 
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to the computations of the Price Section of the War Indus- 
tries Board, the total value of the output of 90 raw materials 
in the United States was 34,748 millions of dollars. This 
figure represents, of course, a value total of the type 
2 (gm 7 Pi 9 i 7 ) where qm 7 represents the quantity of a giver 
raw material produced in 1917 and Pim represents the 
average price of that commodity in 1917. In 1918 both 
quantities and prices were different. If, however, we obtain 
another value aggregate using 1918 quantities and 1917 
prices we shall have a figure differing from that for 1917 
only in respect of the quantity factor. For the 90 raw mate- 
rials in question this total, which is represented by the 
expression ^(qmsPiw) amounted to $35,109,000,000. The 
totals for 1918 and 1917 are comparable, being both in 
dollar units. The difference between them measures the 
change in physical production between the two years. As 
an index of this change we have 

l — _ $35,16 9 _ 2 

2(<7i9i7pi9i7) $34,748 

This index will be recognized as one of the aggregative 
types discussed in Chapter YI, except that the p’s and the 
q’s are interchanged. When information concerning both 
quantities produced and average per unit prices is available, 
these aggregative indices, or the "ideal” index which is 
a combination of two such aggregative measures, may be 
employed for the measurement of quantity changes as well 
as for price changes. The "ideal” index, when used for 
this purpose, takes the form 

T = . A(gipo) S(gipi) 

y 2(?oPo) S(goPi) 

where q<> and p 0 represent the quantities and prices of the 
individual commodities in the base year, while and pi 
represent quantities and prices in the given year. The 
procedure in the computation of such an index is identical 
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with that employed in computing the “ ideal ” price index, 
with prices and quantities reversed. This formula may be 
modified, as was the corresponding price index, to 

2(po Hh Pi)Qi 
2(p<j + Pi)Qo 

or to a form in which the p’s come from some intermediate 
year. In one form or another, the aggregative type of 



Fig. 65. — ■ Changes in the Physical Volume of Manufacturing Production 
in the United States, 1914-1935. All Commodities, Capital Goods and 
Consumption Goods 

index is well adapted to the requirements of an index of 
physical volume. 1 

The aggregative procedure lends itself readily to the con- 

1 Since the price or value factor enters in the derivation of such an index, 
whether it be constructed from relative numbers or from value aggregates, 
no quantity index is completely divorced from pecuniary measurements. For 
a discussion of this point, and of other logical problems involved in the con- 
struction of index numbers of production, see Arthur F. Burns, “The Measure- 
ment of the Physical Volume of Production,” Quarterly Journal of Economics, 
February, 1930. 
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struction of index numbers for commodity groups. This 
is desirable in the study of production movements, as it 
is for prices. The significant features of production changes 
over a given period may be far more clearly revealed by 
measurements of relative changes in the output of different 
classes of goods than by a general index of production. 

Changes in the volume of production of various classes 
of manufactured goods during the period 1914-1935 are 
indicated by the following measurements, constructed by 
the National Bureau of Economic Research. The basic 
data, which were compiled by the Census of Manufactures, 
provide the quantity and (by derivation) the unit price 
records required for the “ideal” formula. That formula, 
slightly modified for working purposes, was employed in 
the construction of these index numbers. 


Table 81 


Index Numbers of the Physical Volume of Production of 
Manufactured Goods in the United States, 1914 - 1935 1 



All 

industries 

Durable 

goods 

Scmi- 

durable 

goods 

Perish- 

able 

goods 

1914 

100.0 

100.0 

100.0 

100.0 

1919 

129.5 

141.7 

120.9 

123 2 

1921 

104.5 

99.6 

104.6 

108 9 

1923 

155.8 

183.7 

140.2 

135.4 

1925 

159.5 

185.2 

141.8 

144.4 

1927 

163.3 

177.2 

151.0 

154.9 

1929 

183.7 

210.9 

162.5 

170.9 

1931 

138.2 

112.3 

137.4 

154.9 

1933 

128.0 

91.4 

140.1 

144.4 

1935 

160.5 

143.9 

164.4 

163.9 


Goods 
destined 
for human 
consump- 
tion 

100.0 

129.1 

109.4 

150.4 

154.0 

159.5 
177.7 
146 9 

142.6 

171.3 


Goods 
destined 
for capital 
equipment 
and con- 
struction 

100.0 

129.5 
91.8 

164.3 

167.7 

166.7 

192.0 

103.7 

81.3 

122.5 


Selected measurements from Table 81 are shown graphi- 


cally in Fig. 65. 


1 Constructed by the National Bureau of Economic Research, New York. 
See Economic Tendencies in the United States for a statement on procedure. 
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Adjusted Index Numbers of Production 

In the analysis of time series we have seen that cyclical 
fluctuations are often the objects of primary interest. This 
is particularly true in the study of physical volume, for 
changes in the volume of production and trade are features 
of fundamental importance in business cycles. Methods 
have been explained, in the preceding chapters, by means 
of which we seek to measure the cyclical fluctuations in 
individual series (fluctuations which are inextricably entan- 
gled with accidental movements of major and minor degree). 
An obvious next step, in the study of general business 
conditions, is the combination of the cyclical-accidental 
movements in a number of series into a single index. The 
utility of such an index of changes in the physical volume 
of production in the course of the business cycle is evident. 

When annual data are employed the construction of an 
index of these cyclical changes is simple. No problem of 
seasonal variation enters, and secular trend alone has to 
be taken account of. Two different methods by which 
this may be done present themselves. Edmund E. Day, 
a pioneer in this field of economic research, has tested both 
methods. 

The first involves the fitting of an appropriate line of 
trend to each of the constituent series. The actual items 
are then expressed as percentages of the corresponding 
trend values. When this has been done for each series, 
the final adjusted index for a given year is obtained by 
taking a weighted average of these percentages for that 
year. Each commodity may be weighted in this averaging 
process, as in the calculation of the unadjusted index. 
The resulting adjusted index is in terms of relatives, but 
these relatives refer to a hypothetical “normal,” instead 
of to any fixed base. This is the desired index of cyclical- 
accidental changes in the physical volume of production. 
With monthly data the process is the same, except that.. 
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before being averaged, the deviations from trend are cor- 
rected to eliminate the influence of recurrent seasonal 
movements. 

In the process of averaging deviations from trend, account 
should be taken of the relative variability of the series 
being combined. As an example, we may consider the 
combination of data of pig iron production and cattle 
receipts in a general index of production. Reducing pig 
iron production to terms of “seasonably adjusted deviations 
from trend,” we obtain a series marked by rather extreme 
fluctuations. The standard deviation of this adjusted 
series, for a given period, was 27. For cattle receipts, cor- 
respondingly adjusted, the standard deviation was 11. In 
any combination of the two series of percentage deviations 
the more widely fluctuating pig iron measurements will exer- 
cise a dominant influence, unless correction is made. The use 
of weights defining the relative economic importance of the 
two series will not prevent distortion due to the greater 
variability of the pig iron series. 

One way out of the difficulty is to divide the deviations 
from trend by the respective standard deviations, before 
averaging. This gives an index in standard deviation units. 
Another procedure involves the combination of the “eco- 
nomic weight” and the standard deviation of each series 
in a weighting factor to be applied directly to the percentage 
deviations from trend. The economic weight is divided 
by the corresponding standard deviation, in making the 
combination. The method is illustrated below. 


Series 

Pig iron production 
Cattle receipts 


Economic Standard 

weight deviation 

20 27 

4 11 


Economic weight 
-5- standard deviation 

. 747 
.363 


The final weighting factors are the figures given in the last 
column. These may, of course, be rounded off when a 
number of series are to be averaged. 1 


1 This useful method of combining economic weights and standard devia 
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The alternative method of combining economic series is 
simpler. Each unadjusted index possesses a trend which is 
“a composite of the persistent tendencies of the several 
original series upon which the unadjusted index is based.” 
It is possible to measure this trend, instead of the separate 
original trends, and secure the adjusted index directly from 
the unadjusted. Day’s results indicate that there is no 
loss of accuracy in the use of the simpler method. 

AN INDEX OF INDUSTRIAL ACTIVITY 

This procedure, with certain modifications, is well exempli- 
fied in an “Index of Industrial Activity in the United 
States,” constructed by the Chief Statistician’s Division 
of the American Telephone and Telegraph Company. 1 The 
elements of this index are monthly data; seasonal corrections 
are therefore necessary. When these corrections have been 
made a general index measuring long-term growth and 
cyclical-accidental fluctuations, in combination, is con- 
structed by averaging 11 series, with appropriate weights. 2 
This index is shown for the period 1899-1937, with line 
of trend, in Fig. 66. The trend line was fitted by least 
squares to data for the period 1899-1930, with the war 
years, 1917-1918, omitted. 

When each monthly value of the index is expressed as 
a percentage deviation from the corresponding trend value, 
the measurements presented in Table 82, and graphically 
portrayed in Fig. 67, are obtained. The cyclical-accidental 

tions has been employed by G. W. Starr, Director of the Bureau of Business 
Research of Indiana University. I am indebted to him for the example. 

1 This index has been constructed for the use of the staffs of the Bell system 
companies, and is not available for distribution. It is published here by cour- 
tesy of the American Telephone and Telegraph Company. 

2 The following series were used for the later years of the period covered : 
steel ingot production, pig iron production, automobile passenger car produc- 
tion, building contracts awarded (on a square foot basis), cotton consumption, 
wool consumption, slaughter of cattle and hogs, newsprint consumption, mis- 
cellaneous freight car loadings, electric power consumption, and employment 
in manufacturing industries. Since employment is included, the index goes 
slightly beyond the field of strict physical production. It is intended to be 
an index of industrial activity. 
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fluctuations in industrial activity, as represented by the 
11 series employed, are traced by the movements of this 
index. 

Index op Industrial Production of the Board op 
Governors of the Federal Reserve System 

A comprehensive monthly index of production in mining 
and manufacturing industries of the United States is con- 
structed by the Division of Research and Statistics of the 
Board of Governors of the Federal Reserve System. This 
index is designed to serve current needs. In the selection 
of its components emphasis has been placed upon the 
promptness with which basic data on the output of industrial 
commodities become available, as well as upon their accuracy 
and representativeness. 

The chief points of general interest relating to this index 
may be briefly noted. 

Coverage. The index is derived from 60 individual series, 
measuring production in some 35 industries. Approximately 
80 per cent of the total industrial production of the United 
States is represented directly or indirectly in the index. 

Base period. The base of the published relatives is daily 
average production during the three years 1923, 1924, and 
1925. The final indices appear as relatives on this base, 
both with and without seasonal correction. 

Character of data used. For each commodity production 
is computed in terms of average output per working day. 
By this method distortion due to changes from one month 
to the next in the number of Sundays and holidays included 
is avoided. 

Form of index number. The index is of the weighted 
aggregative type. Original quantity figures are multiplied 
by weighting factors which convert them into common 
units (i.e., values, in dollars). In deriving the final index, 
the aggregate for a given date is expressed as a percentage 
of the base-period aggregate. 
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Weighting factors. For mineral products the weight for 
each commodity is its average per unit value in the base 
period. For manufactured products the weight for a given 
commodity is the per unit “value added” (i.e., added by 
manufacture), modified to the extent that the commodity 
in question is taken to represent other manufactured prod- 
ucts not directly included in the index. These “weights” 

thus correspond to p’s in the aggregative formula , 

2(?oPo) 

except that for a manufactured product the p is a “price” 



Fig. 68. — Physical Volume of Industrial Production in the United States, 
1919-1937 (1923-1925 average = 100) 


for the services of agents of fabrication, with a modification 
to allow the given commodity to represent similar products 
for which quantity data are not available. The weights 
for manufactured goods were drawn from the Census of 
Manufactures for 1923. The p 0 used to weight the q for 
manufactures is thus not strictly a base-period price. 1 

Adjustment for seasonal variation. No correction for trend 
is made, but in one form of the index an adjustment is 
made to eliminate the effect of seasonal fluctuations in the 


'Weighting factors were modified for the period 1919-1922 by the com- 
bination of weights for 1919 with those for the base period. 
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Table 83 


Index of Industrial Production, Board of Governors of the Federal 
Reserve System, 1919-1937 
(Adjusted for seasonal variation. 1923-1925 average = 100) 


Month 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

1928 

Jan. 

82 

95 

67 

73 

99 

100 

105 

106 

107 

107 

Feb. 

79 

95 

66 

76 

100 

102 

104 

105 

108 

109 

March 

76 

93 

64 

80 

103 

100 

103 

106 

110 

108 

April 

78 

88 

64 

77 

106 

95 

102 

107 

108 

108 

May 

78 

90 

66 

81 

106 

89 

102 

106 

109 

108 

June 

83 

91 

65 

85 

106 

85 

102 

108 

107 

108 

July 

87 

89 

65 

85 

104 

84 

103 

108 

106 

109 

Aug. 

89 

89 

67 

83 

103 

89 

103 

110 

106 

110 

Sept. 

87 

86 

68 

88 

100 

94 

101 

111 

104 

113 

Oct. 

86 

83 

71 

93 

99 

95 

104 

111 

102 

115 

Nov. 

85 

76 

71 

97 

98 

97 

107 

110 

101 

117 

Dec. 

86 

72 

70 

100 

97 

101 

109 

107 

102 

118 

Annual 

index 

83 

87 

67 

85 

101 

95 

104 

108 

106 

111 


Month 

1929 

1930 

1931 

1932 

1933 

1934 

1935 

1936 

1937 

Jan. 

119 

106 

83 

72 

65 

78 

90 

97 

114 

Feb. 

118 

107 

86 

69 

63 

81 

89 

94 

116 

March 

118 

103 

87 

67 

59 

84 

88 

93 

118 

April 

121 

104 

88 

63 

66 

85 

86 

101 

118 

May 

122 

102 

87 

60 

78 

86 

85 

101 

118 

June 

125 

98 

83 

59 

91 

83 

87 

104 

114 

July 

124 

93 

82 

58 

100 

76 

86 

108 

114 

Aug. 

121 

90 

78 

60 

91 

73 

88 

108 

117 

Sept. 

121 

90 

76 

66 

84 

71 

91 

109 

111 

Oct. 

118 

88 

73 

67 

76 

73 

95 

110 

102 

Nov. 

110 

86 

73 

65 

72 

74 

96 

114 

88 

Dec. 

103 

84 

74 

66 

75 

86 

101 

121 

84 

Annual 

index 

119 

96 

81 

64 

76 

79 

90 

105 

110 


output of individual commodities. Seasonal indices were 
computed by averaging the ratios of actual data to twelve- 
month moving averages. (See Chapter VIII.) Where there 
was evidence of progressive change in the seasonal pattern, 
the seasonal adjustments for a given commodity were 
modified from year to year. The actual adjustment for 
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seasonal change is made by dividing the daily average 
output of a given commodity in a stated month by the 
seasonal index for that month, expressed as a ratio (i.e., 
as 1 . 10, if the conventional index were 110). The seasonally 
adjusted q would thus be reduced if the seasonal index 
were above 1 . 00, raised if the seasonal index were below 
1 . 00. In the construction of the seasonally corrected 
index, these adjusted q’s are used in the aggregative formula 
previously described. 1 

Monthly values of this index are given in Table 83, for 
the period 1919-1937. The index is shown graphically in 
Fig. 68 on page 317. 

Derived Indices of Production and Productivity 

It is possible, where suitable records of value of product 
and indices of price changes are available, to derive an 
index of production by indirection. In the case of a single 
commodity it is obvious that pq p = q. (Here q repre- 
sents the number of physical units produced, p represents 
average per unit price, and pq is the aggregate value.) 
A similar process is possible in handling statistics relating 
to a number of commodities, in combination. Indeed, the 
records may be in the form of relatives, or index numbers, 
covering a number of months or years. Division of a value 
index by a price index relating to the commodities included 
in the value index will yield an index measuring changes 
in physical output. 

This procedure may sometimes be used to obtain meas- 
urements that could not possibly be built up by combining 
a number of individual records. Whether the method is 
applicable in a given instance depends upon the compara- 
bility of the price and value index numbers. The strict 

1 A detailed description of the constituents of this index and of the pro- 
cedure employed in its construction is given in the Federal Reserve Bulletin 
for February, 1927. Revisions are noted in the issues of that Bulletin for 
March, 1932, Sept., 1933, Nov., 1936, and March, 1937. The index appears 
in current issues of the Federal Reserve Bulletin . 
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requirement that the price index relate to precisely the 
commodities included in the value index cannot generally 
be met. If we assume that a given price index is fairly 
representative of the commodities covered by the value 
records, and if the formula employed in the construction 
of the index is appropriate, the method may be justified 
as a means of approximating the required index of physical 
output. 

An example of such a procedure is furnished by the 
materials in Table 84. These illustrate a method used in 
deriving an index of production of manufactured goods. 
The indices in col. (3) are derived directly from the aggre- 
gate figures on “value added by manufacture.” The indices 
in col. (4) measure changes in average “value added” per 
unit, or cost of fabrication per unit, of manufactured goods. 
(This is, in effect, a price index, the price covering the 
services of manufacturing agents in the process of fabrica- 
tion.) This series of index numbers is based upon records 
available for a representative sample of manufacturing 
industries. The general index of manufacturing production, 

Table 84 

Illustrating the Derivation of Index Numbers of the Physical Volume 
of Manufacturing Production, 1923-1929 1 


(1) 

(2) 

Total 

(3) 

(4) 

Index of 

(5) 

Derived 


value added , 

value added 

index of 

Year 

all census industries 

per unit of 
product , 

physical 

volume 


(in 

(in 

industries 

of 


millions 

relcr 

included in 

production 


of dollars) 

lives) 

sample 

(3) + (4) 

1923 

25,850 

100.0 

100.0 

100.0 

1925 

26,778 

103.6 

97.3 

106.4 

1927 

27,585 

106.7 

92.4 

115.4 

1929 

31,844 

123.2 

96.8 

127.3 


1 This table is taken from Economic Tendencies in the United States , New 
York, National Bureau of Economic Research, 308. 



MEASUREMENTS OF PRODUCTIVITY 321 


relating to all industries, is derived by dividing the relatives 
for total “value added” by the index numbers measuring 
changes in “value added” per unit of product (with a 
suitable shift in the decimal point). 

The derived measurements given in col. (5) of Table 84 
are probably more accurate than index numbers based upon 
directly enumerated physical products. For the gaps in 
the coverage of the latter are serious. Limitations of coverage 
are the more serious in that the excluded industries are 
in many cases just the new, rapidly developing industries 
the output of which is growing most rapidly. 

A somewhat similar process of derivation is employed in 
the construction of measurements of industrial productivity. 
It is impossible, by direct observation, to compile records 
of output per man or per man-hour over any considerable 
area of industrial activity. However, given accurate indices 
of physical production and comparable records of number 
of workers employed or of man-hours worked, one may 
derive index numbers measuring changes in productivity. 

An example of this procedure is given in Table 85. The 
measurements given should be regarded as approximations 
only. 


Table 85 


Index Numbers of Physical Volume of Production, Man-Hours 
Worked and Output per Man-Hour, Manufacturing 
Industries of the United States, 1929-1935 



Physical volume 

Total number 

Estimated 

Year 

of manufacturing 

of man-hours 

output per 


production 

worked 

man-hour 

1929 

100 

100 

100 

1931 

75 

66 

114 

1933 

70 

60 

117 

1935 

87 

70 

124 


Between 1929 and 1935 the total volume of manufacturing 
production declined 13 per cent. The number of man- 
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hours worked decreased by 30 per cent, however. The 
indicated gain in output per man-hour was 24 per cent. 

Measurements such as these are of unquestioned value 
to the student of industrial change, but their limitations 
should be clearly stated. The initial necessity of full com- 
parability between the output and employment records has 
been mentioned. Discrepancies here may lead to serious 
errors in the derived measurements. More difficult to 
detect are technical industrial changes that do not appear 
in the statistical records. Changes in the quality of the 
goods represented in the production index may lessen the 
accuracy of that index, and affect the productivity measure- 
ments. If employment is measured in terms of number of 
men employed, the resulting index of per capita output 
may be seriously distorted by changes in the length of the 
working week. Again, if only direct labor is enumerated 
in the employment index, a shift in technical methods that 
involves the use of a much larger proportion of indirect 
labor may lead to a great advance in apparent productivity, 
which far exceeds the real gain. Some of the gain that 
apparently follows the increased mechanization of a plant 
or a process is of this fictitious sort. Labor that precedes 
the direct act of production, and servicing and supervising 
labor, may have replaced direct labor. Failure to take 
account of the contributions of these indirect applications 
of labor may lead to grossly exaggerated measures of 
productivity gains. 

The purpose of the preceding pages has been to exemplify 
procedures used in the measurement of changes in produc- 
tion, with incidental reference to related problems. While 
there is no one standard method, it will be clear that the 
construction of quantity index numbers requires no involved 
procedure. Certain special problems — of weighting, of 
measuring secular and seasonal movements, of ensuring 
comparability when methods of derivation are employed — 
have been noted. In addition, most of the problems that 
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bulk large in the construction of price index numbers are 
faced in this area also. The task of obtaining accurate, 
homogeneous series of basic data entails no less careful 
field work in production than in prices. Quality changes 
lessen the accuracy of both types of index numbers. Com- 
parisons over considerable time periods are rendered inaccu- 
rate by such quality changes and perhaps even more by 
changes in “regimen” — in the complex of tastes, consuming 
habits, and technical methods that determines the weighting 
factors used in the construction of index numbers. In spite 
of these difficulties substantial progress has been made in 
recent years in the improvement of measures of industrial 
activity of the type discussed in this chapter. More com- 
prehensive and more accurate data are being compiled, 
and technical standards in the construction of index numbers 
are being steadily raised. These gains are contributing to 
a notable advance in our knowledge of economic processes. 
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CHAPTER X 


THE MEASUREMENT OF RELATIONSHIP: LINEAR 
CORRELATION 

In discussing averages and measures of dispersion and 
skewness we have been dealing with methods of describing 
a single frequency distribution. The arrangement of the 
values of a single variable along a scale may be portrayed 
by means of these measures, which enable the central value 
to be determined and the character of the distribution 
about that central value to be described. In the analysis 
of time series a somewhat different problem has been faced. 
In such cases we are concerned with the changing values 
of a variable factor with the passage of time, and seek to 
determine the degree to which the changes in value are 
due to the play of different forces — the secular trend and 
cyclical, seasonal, and accidental factors. The preceding 
chapters dealt with methods by which we might measure 
the effect upon a given series of each of these factors (with 
the exception of accidental fluctuations). 

Certain of these methods are applicable to the problem 
now before us. It was found that in dealing with time 
series the relationship between time and the long term trend 
factor may be described by a definite mathematical equa- 
tion. That is, trend or growth seems to be a function of 
time for many economic series. Where such a relationship 
prevails, whether it hold precisely or only approximately, 
there is a distinct advantage in securing a mathematical 
expression which describes it. A similar but much broader 
problem is now to be discussed. If it is possible in dealing 
with time series to secure a definite mathematical equation 
for the relation between time and the normal values of the 
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items in a given series, cannot the same device be employed 
in studying the relationship between other variables? Can 
we not define, mathematically, the relation between cotton 
production and the price of cotton, between corn yield 
and rainfall, between earnings and the output of labor? 
If this can be done, it will place in the hands of the econo- 
mist a very powerful tool, giving his methods something 
of the precision which attaches to the work of the physical 
scientist. 

The Relation between Number of Taxable Personal 
Incomes and Motor Vehicle Registration 

As a typical problem we may consider the relation between 
the number of taxable personal incomes and the number 
of passenger automobiles registered, by states in 1934. 
The figures are given in columns (2) and (3) of Table 86. 1 

These figures are plotted in Fig. 69, each dot representing 
the relation between the number of taxable incomes and 
the number of registered passenger cars for a given state. 
Such a figure is termed a “scatter diagram.” It is clear 
from this diagram that there is a relationship between the 
two variables. In general, the states with a large number 
of taxable personal incomes are also those having a large 
number of motor vehicle registrations. The relationship, 
however, is not perfect. Two states with the same number 
of taxable incomes may differ quite widely in the number 
of registered vehicles. Thus both Rhode Island and Colorado 

1 Nine states for each of which there were more than 100,000 individual 
income tax returns and more than 685,000 passenger cars registered in 1934 
have not been included. The observations for these states, some of which are 
materially affected by the presence of important industrial centers, depart 
rather widely from those for the remaining states, and are marked by a func- 
tional relationship between personal incomes and motor vehicle ownership 
somewhat different from that prevailing for the country at large. The states 
thus excluded are New York, Pennsylvania, New Jersey, Illinois, Massachu- 
setts, Michigan, Texas, Ohio, and California. The state of Washington has 
also been excluded, since the income tax returns for that state are combined 
with those of Alaska, in the reports of the Bureau of Internal Revenue. The 
results are to be interpreted, of course, with these restrictions in mind. 



Table 86 


Taxable Personal Incomes and Passenger Automobile Registration in 
Thirty-Eight States, 1934 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 


No. of taxable No. of passen- 





personal in- 

ger cars reg- 




State 

comes, 1934 

istered, 1934 





( thousands ) 

( thousands ) 





X 

Y 

XY 

X* 

y* 

Alabama 

23 

192 

4,416 

529 

36,864 

Arizona 

11 

80 

880 

121 

6,400 

Arkansas 

13 

162 

2,106 

169 

26,244 

60,516 

Colorado 

31 

246 

7,626 

961 

Connecticut 

91 

310 

28,210 

8,281 

96,100 

Delaware 

11 

45 

495 

121 

2,025 

Florida 

33 

280 

9,240 

1,089 

78,400 

Georgia 

38 

317 

12,046 

1,444 

100,489 

Idaho 

9 

91 

819 

81 

8,281 

Indiana 

70 

680 

47,600 

4,900 

462,400 

Iowa 

48 

591 

28,368 

2,304 

349,281 

Kansas 

36 

453 

16,308 

1,296 

205,209 

Kentucky 

35 

295 

10,325 

1,225 

87,025 

Louisiana 

37 

199 

7,363 

1,369 

39,601 

Maine 

21 

141 

2,961 

441 

19,881 

Maryland 

84 

288 

24,192 

7,056 

82,944 

Minnesota 

67 

594 

39,798 

4,489 

352,836 

Mississippi 

13 

141 

1,833 

169 

19,881 

Missouri 

98 

632 

61,936 

9,604 

399,424 

Montana 

17 

97 

1,649 

289 

9,409 

Nebraska 

27 

350 

9,450 

729 

122,500 

Nevada 

5 

26 

130 

25 

676 

New Hampshire 

17 

91 

1,547 

289 

8,281 

New Mexico 

8 

67 

536 

64 

4,489 

N. Carolina 

32 

385 

12,320 

1,024 

148,225 

N. Dakota 

10 

130 

1,300 

100 

16,900 

Oklahoma 

39 

403 

15,717 

1,521 

162,409 

Oregon 

27 

233 

6,291 

729 

54,289 

Rhode Island 

31 

124 

3,844 

961 

15,376 

S. Carolina 

15 

182 

2,730 

225 

33,124 

S. Dakota 

8 

146 

1,168 

64 

21,316 

Tennessee 

38 

299 

11,362 

1,444 

89,401 

Utah 

11 

85 

935 

121 

7,225 

Vermont 

10 

69 

690 

100 

4,761 

Virginia 

48 

317 

15,216 

2,304 

100,489 

W. Virginia 

30 

167 

5,010 

900 

27,889 

Wisconsin 

93 

589 

54,777 

8,649 

346,921 

Wyoming 

7 

52 

364 

49 

2,704 

Totals 

1,242 

9,549 

451,558 

65,236 

3,610,185 
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had 31,000 taxable personal incomes in 1921, yet the former 
had 124,000 passenger cars registered, while the latter had 
246,000. Were the relationship perfect a single and unchang- 
ing value of the F-variable would always be found paired 
with a given value of the X-variable. 

Our first problem is the derivation of an equation to 
describe this relationship which, while not perfect, is clearly 



Taxable Personal Incomes In 1934 (Thousands) 


Fig. 69. — Scatter Diagram Showing the Relation between Taxable 
Personal Incomes and Passenger Car Registration, by States, in 1934, 
with Line of Average Relationship 

existent. There is here a relationship ana’ogous to a trend, 
and it io apparently a trend which can be represented by 
a straight line. The equation to a straight line, fitted by 
the method of least squares to the points on the scatter 
diagram, will express mathematically the average relationship 
between these two variables. Such a line could, of course, 
be fitted by inspection, but a more accurate result will be 
obtained by the method of least squares. 

This calls for the solution of the following normal equa- 
tions: 

2(F) = Na + 62(A) 

2 (XF) = o2(X) + 62 (X*). 

The values required for the solution of these equations may 
be derived from the data as arranged in Table 86. Sub- 
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stituting, we have 

9,549 = 38a + 1,2426 
451,558 = 1,242a + 65,2366. 

Solving 

a = 66.321 
6 = 5.659. 

The required equation is 

Y = 66.321 + 5.659X. 1 

This line is plotted in Fig. 69. 

A mathematical expression has now been secured for 
the relation between the two variables being studied, the 
number of taxable personal incomes, by states, and the 
number of passenger automobiles registered. The former 
is the independent or X-variable in the equation, the latter 
the dependent or y- variable. This equation constitutes a 
measure of the functional relationship between these two 
variables, but it is only an expression of average relationship. 
How significant is the equation? If the relationship were 
perfect, and the plotted points all lay on the line describing 
this relationship, the equation could be used with confidence 
as an accurate instrument for determining the value of one 
variable from a value of the other. But a line with a definite 
equation may be fitted to points which depart very widely 
from it, which are widely dispersed. In such a case the 
equation may have the appearance of describing a precise 
relationship but the variation is so great that it cannot be 
used with confidence. It is the same problem as that which 
arises when an average is employed. We must know how 
significant the average is, how great the concentration about 
it, before we may use it intelligently. So the equation of 

1 In the chapters on correlation capital letters (I V, X , Y , etc.) are used to 
represent original values of the variable quantities, as measured from tne zero 
points on the scales of actual values. Capital letters with prime marks are 
used to measure deviations from arbitrary origins, X’ and Y f for such devia- 
tions in class-interval units, X" and Y" for such deviations in original units 
of measurement. Small letters (w, x, y t etc.) are used to represent values of the 
variables expressed as deviations from their respective arithmetic means. 
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relationship between variables means little unless we know 
to what extent it holds in practical experience. We must 
have a measure of the dispersion about the line we have 
fitted. 

In describing the frequency distribution, the standard 
deviation is used as the best general measure of variation. 
It is, obviously, the measure we need in determining the 
reliability of the equation of average relationship. The 
standard deviation about this line will not only serve as a 
general index of the significance of this equation but will 
enable us to measure the degree of accuracy of estimates 
based upon the equation. 

THE COMPUTATION OF THE STANDARD ERROR 
OF ESTIMATE 

The standard deviation about a line of average relation- 
ship, being a measure of the accuracy of estimates, may 
be termed the standard error of estimate. The term standard 
deviation is generally confined to the root-mean-square 
deviation about the arithmetic mean. The standard error 
of estimate is represented by the symbol S. 

In the computation of S we must know the computed value 
of Y which corresponds to each given value of X. By 
substituting the given values of X in the equation 

Y = 66.321 + 5 . 659 A' 

normal Y values may be computed. The deviations of the 
actual Y values from the computed may then be determined. 
The root-mean-square of these deviations is the required 
measure. A method of computation is illustrated in Table 87. 
From this table we have 

„ ,/421,398.31 

**- v - 38 — 

= 105.3 (thousand) motor cars. 

(The symbol S y is used, as this is the standard error of the 
K- variable.) 



Table 87 


Computation of Standard Error of Estimate 


(1) 

(2) 

No. of passenger 
cars registered , 

(3) 

(4) 

d 

(5) 

State 

1934 

(in thousands) 

Y -actual 

Y -computed 

(2) - (3) 

d* 

Alabama 

192 

196.5 

- 4.5 

20.25 

Arizona 

80 

128.6 

- 48.6 

2,361.96 

Arkansas 

162 

139.9 

+ 22.1 

488.41 

Colorado 

246 

241.8 

+ 

4.2 

17.64 

Connecticut 

310 

581.3 

- 271.3 

73,603.69 

Delaware 

45 

128.6 

- 83.6 

6,988.96 

Florida 

280 

253.1 

+ 26.9 

723.61 

Georgia 

317 

281.4 

+ 35.6 

1,267.36 

Idaho 

91 

117.3 

- 26 3 

691 . 69 

Indiana 

680 

462.4 

+ 217 6 

47,349.76 

Iowa 

591 

337.9 

+ 253.1 

64,059.61 

Kansas 

453 

270.0 

+ 183.0 

33,489.00 

Kentucky 

295 

264.4 

+ 30.6 

936.36 

Louisiana 

199 

275.7 

- 76.7 

5,882.89 

Maine 

141 

185 2 

- 44.2 

1,953.64 

Maryland 

288 

541.7 

- 253.7 

64,262.25 

Minnesota 

594 

445.5 

+ 148.5 

22,052.25 

Mississippi 

141 

139.9 

+ 

1.1 

1.21 

Missouri 

632 

620.9 

+ 11.1 

123.21 

Montana 

97 

162.5 

- 65.5 

4,290.25 

Nebraska 

350 

219.1 

+ 130.9 

17,134.81 

Nevada 

26 

94.6 

- 68.6 

4,705.96 

New Hampshire 

91 

162.5 

- 71.5 

5,112.25 

New Mexico 

67 

111.6 

- 44.6 

1,989.16 

N. Carolina 

385 

247.4 

+ 137.6 

18,933.76 

N. Dakota 

130 

122.9 

+ 

7.1 

50.41 

Oklahoma 

403 

287.0 

+ 116.0 

13,456.00 

Oregon 

233 

219.1 

+ 13.9 

193.21 

Rhode Island 

124 

241.8 

- 117.8 

13,876.84 

S. Carolina 

182 

151.2 

+ 30.8 

948.64 

S. Dakota 

146 

111.6 

+ 34.4 

1,183.36 

Tennessee 

299 

281 4 

+ 17.6 

309.76 

Utah 

85 

128.6 

- 43.6 

1,900.96 

Vermont 

69 

122.9 

- 53 9 

2,905.21 

Virginia 

317 

338.0 

- 21.0 

441.00 

W. Virginia 

167 

236.1 

- 69.1 

4,774.81 

Wisconsin 

589 

592.6 

- 3.6 

12.96 

Wyoming 

Total 

52 

105.9 

- 53.9 

2,905.21 
421 .398.31 
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The measure 5 V is to be interpreted in precisely the same 
way as the standard deviation about an arithmetic mean. 
Given an approximately normal distribution of items about 
the line of relationship, 68 per cent of all the cases will lie 
within a range of ± 5 (in this case 105.3), 95 per cent 
will fall within ± 25 (in this case 210.6) and 99.7 will 
fall within ±35 (in this case 315.9). If there were no 
scatter about the line fitted to the points representing the 
corresponding values of X and F, 5 would have a value 
of zero, and the value of Y could be estimated from the 
value of X with perfect accuracy. The less the dispersion 
about the line, the smaller the value of 5. The value of 
5 serves, therefore, as an indicator of the significance and 
usefulness of the line which describes the relationship 
between the two variables. The standard error, it should 
be noted, is expressed in the same units as the original 
F-values. 


THE MAKING OF ESTIMATES 

We may, for a moment, consider the significance of these 
results. Let us assume that, not knowing the number of 
motor vehicles registered in a given state, we are under the 
necessity of estimating it. Two methods are open to ut. 
We may, in the first place, base the estimate upon our 
knowledge of the F- variable alone. The total number of 
passenger automobiles in the 38 states included in the 
study is 9,549,000. Dividing this by 38 we have 251,289 
as the average. With no specific information as to the 
registration in a given state, the arithmetic mean of all 
the state figures would be taken as the most probable value 
for the state in question. (The most probable value of a 
series of observations is the mean of the series.) How may 
we judge of the accuracy of this estimate? The standard 
deviation of the original distribution is a measure of the 
degree of variation about the mean and, therefore, a measure 
of the accuracy of an estimate based upon the mean. If 
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the distribution approximates the normal type, the chances 
are 68 out of 100 that the true value for the state in question 
will not differ from the mean by more than the standard 
deviation. The standard deviation of passenger automobile 
registration by states, as recorded in Table 86, is 178.5. 
The mean affords, therefore, a basis for a reasonable estimate, 
and the standard deviation affords some indication of the 
probabilities involved in making this estimate. 

Another method of estimating the motor vehicle registra- 
tion in a given state is open to us if we know the number 
of taxable personal incomes in that state. We know, as a 
result of the study described in the preceding pages, that 
the average relationship between passenger car registration 
and number of taxable personal incomes is described by the 
equation 

Y = 66.321 + 5.659A. 

(The unit is 1,000 for each variable, it will be recalled.) 

If a state has 50,000 taxable personal incomes, it may be 
estimated from this equation that there are 349,271 passenger 
automobiles registered in that state. This is the most prob- 
able value of Y as determined from the equation of average 
relationship. Is this estimate any better than the previous 
one, which took the mean Y as the most probable value? 
Does our knowledge of the average relationship between 
X and Y aid us in estimating the value of Y from a known 
value of X ? 

The answers to these questions are given by the standard 
error of estimate, and by the relationship between the 
standard error of estimate and the standard deviation of Y. 
The standard error of estimate (that is, the standard devia- 
tion about the line of average relationship) is 105.3. The 
standard deviation of Y is 178.5. Clearly the estimate 
made from the equation is more accurate than the estimate 
based upon the value of the mean Y. In the former case 
the odds are 68 out of 100 that the error will not exceed 
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105.3 or, in terms of the original units, 105,300 vehicles. 
When the estimate is made from the mean, the odds are 
68 out of 100 that the error will not exceed 178,500 vehicles. 1 
From our knowledge of the relationship between the two 
variables, even though that relationship is by no means 
constant or perfect, we are able to reduce materially the 
errors of estimate. 

THE COEFFICIENT OF CORRELATION 

We have now secured two measures which aid us in 
describing the relationship between variable quantities. 
The first is the fundamental equation of relationship, the 
expression of the degree of change in one variable associated, 
on the average, with a given change in the other. The second 
is the standard error of estimate, the measure of the degree 
of “scatter” about the line of average relationship. The 
standard error resembles the standard deviation in that 
it is a measure expressed in absolute terms, in the units 
employed in measuring the original F-values. This measure 
enables us to determine in a given case the probability that 
an estimate based upon the equation of relationship will 
fall within certain limits. 

In measuring variation it has been found that an abstract 
measure of variability is needed, one which is divorced 
from the absolute terms of the given problem. Such a 
measure is particularly needed, it was noted, when different 
distributions are to be compared. So, for measuring the 
degree of variability, a coefficient of variation is employed. 
There is need of a somewhat similar measure in connection 
with our present problem. We need a measure of the 
degree of relationship between two variables, an abstract 
coefficient which is divorced from the particular units 

1 In the present case, with a limited number of items and distributions which 
depart somewhat from the normal type, the precise probabilities cannot be so 
accurately determined from the values of Sy and With this qualification 
in the matter of interpretation we may use S v and <r v as useful measures of 
dispersion. 
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employed in a given case. Karl Pearson has developed such 
a coefficient. 

This measure may be explained in terms of the preceding 
discussion. It was found that the usefulness of estimates 
based upon the equation of relationship could be determined 
by comparing the standard error of Y (the measure of 
scatter about the line of relationship) with the standard 
deviation of Y. If the standard error be as great as the 
standard deviation the equation of relationship is of no 
use to us, but if the standard error be less than the standard 
deviation the accuracy of estimates may be improved by 
using this equation. The significance of the equation is thus 
indicated by the relation between the standard error and 
the standard deviation. But these are both in absolute 
terms, so that by dividing one by the other an abstract 
measure may be secured. Thus we might write 


Measure of correlation 


S,. 

<r y 


A somewhat more useful measure is secured by putting the 
ratio in this form: 


Measure of correlation 



This measure, when used in connection with a linear equa- 
tion, is called the coefficient of correlation and is represented 
by the symbol r. 

A brief consideration of this formula will help to make 
clear the significance of r. If there is no dispersion about 
the line of relationship, S v will have a value of zero; the 
equation describes a perfect relationship between the two 
variables. In this case, as is clear from the formula, r must 
have a value of 1. 

The max i mum value of S„ is one which is equal to <r„. 
Under these conditions, when the equation of relationship 
is of no aid in improving our estimates, the formula will 
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give zero as the value of r. Such a value indicates that 
there is no relationship between the two variables; in other 
words, that the straight line of best fit is horizontal, passing 
through the mean of the T’s. It shows that there is no 
tendency for the high values of Y to be associated with 
high values of X or for high values of Y to be associated 
with low values of X. The two variables fluctuate in absolute 
independence. In such a case the deviation of each point 
from the fitted line is equal to its deviation from the mean, 
and the two root-mean-square deviations are equal, as 
stated. 

Zero and unity are thus the limits to the value of r. 
The values found in practical work fall somewhere between 
these limits, approaching unity in cases where the degree 
of relationship is high. The greater the value of r, the 
greater the confidence that may be placed in the equation 
as an expression of a relation which is approximated in a 
high percentage of cases. In the example presented above, 
dealing with motor vehicle registration and number of 
taxable personal incomes, we have 

r= J\ ( 105 . 3 )* 

V 1 ( 178 . 5) 2 

= . 81 . 

This value indicates a definite and fairly close connection 
between these two variables for the states included in the 
sample. 

The coefficient of correlation may be made somewhat 
more significant by giving it the sign of the constant b in 
the equation of relationship. This sign indicates whether 
the slope of the line is positive or negative and, when 
attached to r, enables us to tell whether the relationship 
is direct or inverse. Thus in the present case high values 
of one variable are paired with high values of the other. 
The correlation is positive and the coefficient should be 
written + .81. When cotton production and prices are 
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correlated the relationship is an inverse one, for high values 
of one variable are generally associated with low values of 
the other. 

The measurement of relationship in a given case is com- 
pleted when we have secured the three measures described. 
The equation of average relationship is an expression of 
the underlying law connecting the two variables, if such a 
law may be assumed. The standard error of estimate meas- 
ures the variation, in absolute terms, about the line of rela- 
tionship. The coefficient of correlation is an abstract measure 
of the degree to which the average relationship actually holds 
in practice. 


DETAILS OF CALCULATION 

In the preceding section the attempt has been made to 
explain the various measures necessary in studying the 
relationship between variable quantities without introducing 
a detailed explanation of procedure. We may now return 
to a consideration of the details of calculation, including 
certain methods by which this calculation may be reduced 
to a minimum. 

The procedure followed in the preceding illustration is a 
logical one to employ in deriving the three required values. 
This method is capable of general application, but the labor 
involved may be materially reduced by taking advantage 
of a short-cut method of deriving S y . This method may be 
first explained with reference to data of the type dealt 
with above. And, for the present, the discussion will be 
confined to cases in which the relationship between variables 
may be described by a straight line. 

The first problem is the derivation of the equation of 
relationship. A line of the type 

Y = a + bX 

is fitted by the method of least squares. 

The next step is the computation of $„ 2 , the square of the 
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standard error of estimate. This was done in the above illus- 
tration by measuring the deviation of each individual obser- 
vation from the fitted line, and getting the mean-square of 
these deviations. It may be shown 1 that this value can be 
derived from the following equation: 

a , 2(7*) - a2(7) - 62 (X 7) 

S * N 

The quantities a and b are the constants in the equation to 
the fitted straight line. The other values relate to the 
original observations. Substituting in this equation a and b 
and the other necessary values, taken from Table 86, we have 2 

1 The standard error of estimate is computed from the formula 

Sv '-~ir 

where d represents a single deviation from the fitted line, or the difference 
between the actual and the computed value of F in a given case. The latter 
is derived from the equation 

Y c = a -f bX. 

(The symbol Y c is used to represent the computed value of Y.) 

If we let Y represent the actual value, we have, for each residual, 

d - Y c - Y 
or 

d = a -f bX - F. (1) 

There will be as many equations of this type as there are points. Multiply- 
ing each one by d, and adding, we have 

2(d 2 ) = o2(d) + b2(dX) - 2(dF). (2) 

But, since the line was fitted by the method of least squares, 

2(d) = 0 
2 (dX) = 0 

(for proof of this see Appendix A) 
and, therefore, 

2(d 2 ) = - 2(dF). (3) 

Returning again to equation (1), we may multiply throughout by Y , and 
add, securing 

2(dF) = a2(F) + 62(XF) - 2 (F 2 ). (4) 

Substituting the equivalent of 2(dF) in equation (3), we have 

2(d 2 ) = 2(F 2 ) - a2(F) - bX(XY) (5) 

from which the given formula for S y 2 is derived. 

2 For this calculation the values of a and b are given to a greater number of 
decimal places than in the equation as first presented. 
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&*- 


3,610,185 - (66.32136 X 9,549) - (5.65925 X 451,558) 


= 11,090 
■Sy = 105.3. 

From this point the procedure may follow that already 
described, r being computed from the formula 




The coefficient r may be secured, however, without com- 
puting S as an intermediate value. The above formula for 
r may be reduced to 


2 _ ogOO + bZ(XY) - Ncy 2 


2 (7 2 ) - Nc y 2 

where Cy is the difference between the mean Y and the 
origin employed in the calculations. 1 If the origin is zero 


1 The formula 


may be written 


r 2 = 1 — 


in which y refers to deviations from the arithmetic mean of the Y 1 s. But 

2W (JY" 


where Y represents a deviation from an arbitrary origin (in this case zero on the 
original scale) and c v represents the difference between this origin and the 
mean of the F’s. 

Therefore 

r> - 1 

2(V») - NCy 1 

Substituting in this equation the equivalent of 2(d a ), as given in the footnote 
on page 338, 

r2 S(F 2 ) -qS(i ) - 6S(XF) 

2(F 2 ) - NCy 2 

Simplifying, 

, _ aX(Y) + &2(XF) - NcS 
r “ S(F 2 ) - Ncy 2 
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on the original Y scale, c y will be equal to the arithmetic 
mean of the F’s. 

In the present case, using the data of Table 86, we have 


Cy 


9,549 

38 


251.289. 


The other values are the same as those employed above in 
computing S y . Substituting in the formula, we have 

2 = 789,228-14 
f 1,210,630.86 
= .6519 
r = .81. 


In effect, then, the labor of fitting a straight line by the 
method of least squares gives us practically all the values 
needed in securing S and r, the two other measures necessary 
for a complete description of the relationship between two 
variable quantities. The only additional values required 
are 2(F 2 ) and c y . 


The Construction of a Correlation Table 

In the example presented above we had only thirty-eight 
observations. With a larger number it becomes practically 
impossible to retain the individual values in the study of 
relationships. These individual items must be grouped in 
significant classes, and all computations must be based 
upon these grouped data. This means, merely, that we must 
handle data organized in frequency distributions. Since 
we are dealing with two variables, however, the simple 
frequency table must be modified to meet the needs of 
the present problem. Such a modified frequency table, 
arranged to facilitate the computation of the values needed 
in studying relationship, is termed a correlation table. 

As a typical problem, involving the construction of such 
a table, we may consider the relation between the discount 
rates of Federal Reserve banks and the corresponding dis- 
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count rates of commercial banks. Since the paper discounted 
by commercial banks may be rediscounted at the Federal 
Reserve banks by the member banks, some degree of 
relationship between the rates may be expected. Our present 
object is the measurement of that relationship. 

The first step is the tabulation of the original observa- 
tions. Monthly values of each variable 1 were secured for 
each of the twelve Federal Reserve cities over a period of 
150 months, from July, 1920, to December, 1932.' In the 
process of tabulation the items must be combined so that 
a Federal Reserve bank discount rate is paired with the 
corresponding rate charged by the commercial banks of the 
same city. Fig. 70 illustrates the method of tabulation. 

Tabulation having been completed, a correlation table 
designed to facilitate the later computations may be con- 
structed. Table 88 illustrates a suitable form. 

In Table 88, it will be noted, an arbitrary origin is em- 
ployed for each variable, and the class-interval unit is used 
in the calculations. We here employ the symbols X' and 
Y' to represent deviations from the arbitrary origin (which 
is located at point 1 .50, 3 .50 on the original scales). 

COMPUTATION OF MEASURES OF RELATIONSHIP 

From this correlation table all the values needed in 
fitting a straight line to the data, and in computing the 
measures S and r, may be derived. The quantities 2(X / )> 
S(X /2 ), 2(F'), and 2(F' 2 ) are computed by methods already 
familiar to the student. The product of the paired values 
2 {X'Y') may be computed directly from the correlation 
table, but it is perhaps simpler for the beginner to re-arrange 
the data in columnar form, as in Table 89 on page 345. 
When the figures are disposed in this way one line is em- 

1 The discount rates of the Federal Reserve banks relate, for the first part 
of the period covered, to trade acceptances; for later years they are u rates 
for member banks on eligible paper.” The commercial bank rates are those 
charged on customers* prime commercial paper. The customary rate over 
a given 30 day period was taken as of the middle of that period. 
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Table 89 


Discount Rates of Federal Reserve Banks and Discount Rates of 
Commercial Banks 


(Computation of values required in curve fitting) 


X' Y' 


0 

1 

2 

4 

0 

1 

2 

3 

4 
2 

3 

4 

5 

6 
2 

3 

4 

5 

6 
7 

3 

4 

5 

6 
7 

4 

5 

6 

7 

8 
9 
6 

7 

8 
9 

10 

11 

7 

8 


0 

0 

0 

0 

1 

1 

1 

1 

1 

2 

2 

2 

2 

2 

3 

3 

3 

3 

3 

3 

4 
4 
4 
4 

4 

5 
5 
5 
5 
5 

5 

6 
6 
6 
6 
6 
6 
7 
7 


/ 

fX'Y' 

4 

0 

2 

0 

1 

0 

9 

0 

1 

0 

9 

9 

19 

38 

19 

57 

29 

116 

16 

64 

27 

162 

122 

976 

96 

960 

3 

36 

4 

24 

25 

225 

111 

1,332 

196 

2,940 

65 

1,170 

1 

21 

1 

12 

90 

1,440 

164 

3,280 

175 

4,200 

45 

1,260 

180 

9 

21 

525 

146 

4,380 

150 

5,250 

8 

320 

32 

1,440 

4 

144 

22 

924 

10 

480 

22 

1,188 

1 

60 

3 

198 

5 

245 

4 

224 
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Table 89 — Continued 

Discount Rates of Federal Reserve Banks a d Discount Rates of 
Commercial Batiks 


X 7 

Y' 

/ 

JX'Y' 

9 

7 

63 

3,969 

10 

7 

9 

630 

11 

7 

36 

2,772 

9 

8 

7 

504 

10 

8 

9 

720 

11 

8 

1 

88 

9 

9 

1 

81 

10 

9 

1 

90 

11 

9 

2 

198 

42,932 


ployed for each compartment of the original correlation 
table in which items have been recorded. 

The values required in fitting a straight line and in 
computing the standard error and the coefficient of correla- 
tion are: 

N = 1,800 2(X' 2 ) = 62,354 

Z(X') = 10,054 S(fX'Y') = 42,932 

2(7') = 6,904 2(7' 5 ) = 30,878. 

The equation to the best fitting straight line is found to 
be 

Y' = - .10277 + . 70509X'. 

Substituting in the formula 

S. jj 

we have 

„ . 30,878 - (- .10277 X 6,904) - (.70509 X 42,932) 

* I5oo 

= .7314 
S v = .855. 

To determine the value of the coefficient of correlation 



COMPUTATIONS 347 

we have only to substitute the proper values in the equation 

_ qS(F') + KS(X'F') - NcS 
2(F' 2 ) - Ncy 2 

When this is done we have 

2 _ (- . 10277 X 6,904) + (.70509 X 42,932) - (1,800 X 14.71149) 
T 30,878 - (1,800 X 14 .71149) 

= 3,080.7178 
4,397.3180 

= .70059 
r = + .837. 

All these calculations have been carried through in class- 
interval units, with reference to an origin at point 1 . 50, 3 . 50 
on the original scales. The value of r is not affected by this 
fact, but the estimating equation and the standard error 
of estimate should be corrected. 

The value of S v , in class-interval units, is .855. Since 
the class-interval of the F-variable is .50, we have, in 
original units, 

S v = .50 X .855 
= .4275. 

The equation may be corrected in a similar fashion. 
The class-interval being .50 both for X and F, each unit 
on the original scale equals two class-interval units. Thus 
a range of 4 points on the original scale is equivalent to 
a range of 8 points on the class-interval scale. For conven- 
ience we may use Y" and X" to define deviations in original 
units (i.e., deviations from the arbitrary origin), where we 
have used Y' and X' to define corresponding deviations in 
class-interval units. Then, for any stated deviation, 
2F" = F'; similarly 2X" = X'. Retaining the values of 
a and b in the equation of average relationship, and sub- 
stituting 2 Y" for Y' and 2X" for X' , we have 

2F" = - .10277 + . 70509 (2X"). 
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Simplifying this, we have 

Y" = - .05138 + .70509X" 

which is the equation in terms of original units. 

This equation refers to an origin whose coordinates 
are 1.50 and 3.50 on the original scales. That is, 
Y" = Y - 3.50, and X" - X - 1.50, where Y and X 
define deviations, in original units, from the zero points 
on the original scales. Making these substitutions we have 

Y - 3.50 = - .05138 + .70509(X - 1.50). 

Simplifying, and rounding off the constants by dropping 
figures that are not significant, we have 

Y = 2.391 + . 705X. 

We have now the three values required for determining 
the relationship between Federal Reserve discount rates 



1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 


X-Federal Reserve Bank Rate-Percent 

Fig. 71. — Scatter Diagram of Federal Reserve and Commercial Bank 
Rates, with Line of Average Relationship and Zones of Estimate 

and corresponding commercial bank rates, during the period 
covered. The equation describes the average relationship, 
the standard error of estimate serves as a measure of the 
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reliability of estimates based upon this equation, and the 
coefficient of correlation serves as an abstract measure ol 
the degree of relationship between the two variables. 

The significance of the standard error, S v , is brought out 
graphically in Fig. 71. The line of average relationship 
has been drawn on this scatter diagram, and what may be 
called “zones of estimate” have been marked out about 
this line. Within the zone having a width equal to 2<S, 
centering at the fitted straight line, 68 per cent of all the 
points should fall, on the assumption that the distribution 
is normal. Within the zone having a width equal to 6 S, 
centering at the fitted straight line, 99.7 per cent of all 
the points should fall, on the same assumption. The smaller 
the value of S the narrower these zones would be, and hence 
the more accurate would be the estimates which are based 
upon the equation of average relationship. 

The Product-Moment Formula for the Coefficient 
of Correlation 

In the preceding examples the coefficient of correlation 
has been computed from the formula 

2 = ggOO + 6S(X7) - Nc u * 

T 2(7*) - NcJ 

which is based upon relations involved in fitting a straight 
line by least squares. The usual formula differs somewhat 
from this, and it is advisable that the student be familiar 
with it. 

When a straight line is fitted to data, the origin being 
at the point of averages, the two normal equations 

2(7) = Na + 62 (X) 

2(X7) = o2(X) + 62 (X 2 ) 

become 

2 (y) = Na + 62 (x) 

2 (xy) = a2(x) + 62 (x 2 ) 
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where y and x measure deviations from the point of averages. 
The first of these equations disappears and the second 
reduces to 


for 


2 (xy) = 62(* 2 ) 


2(z) = 0 and 2 (y) = 0. 


>The slope, b, is the only constant required, and this may be 
computed from the relationship 


b = 


2 fry) 
2 (*’)' 


Under the same conditions the formula 


reduces to 


, 2 a2(F) + bZjXY) - NcJ 

2(F 2 ) - Nc y 2 


r , , bj (xy) 

2 ( 2 /*) 

for c v = 0 when the deviations are measured from the mean 
of the F’s. Substituting for b its equivalent, as just deter- 
mined, we have 

_ 2 (xy) • 2 (xy) 

2 (y 2 ) • 2(**) 


But 2 (y 2 ) = N<r v 2 and 2(x 2 ) = Na 2 . 
Therefore 


and 


= 2 (xy) • 2 (xy ) 
N 2 a v 2 <j 2 


_ 2(xy) 

Na^y 

in which x and y refer to deviations from an origin at the 
point of averages 
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This formula may be expressed 

r = -JL 

(TzCTy 

in which 

y N 

The quantity p is the mean product of the paired values of 
x and y. 

The computation of the coefficient of correlation from 
this formula proceeds along lines somewhat different from 
those outlined above. As we have seen, both the arithmetic 
mean and the standard deviation may be readily computed 
by the selection of an arbitrary origin from which all 
deviations are measured, a later correction being made to 
offset the error involved in using this arbitrary origin. 
Similarly, the mean product p may be computed by a short 
method, requiring the use of assumed means and the applica- 
tion of a correction at the end of the process. 

If x' and y' represent deviations from points arbitrarily 
selected as assumed means, while p' represents the mean 
product of such deviations, then 

j - 3(*V) 
p N 

The computation of p' is not difficult, for deviations may 
be measured from central points, and may be expressed in 
class-interval units. Having p' we may secure the true 
mean product from the formula 

p = p' — c*c„ 

in which c x and c v represent the differences between the 
true and assumed means of the x’s and y’s, respectively. 1 

1 The following is a proof of this relationship: 

x' = deviation of any point from assumed mean of x’s 
x =* deviation of same point from true mean of x’s 
c x = difference between true and assumed means of x’s 
y' =* deviation of same point from assumed mean of y’s 

( Footnote 1 continued on page 352) 
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The Product-Moment Method, Ungrouped Data 

This method may be illustrated with reference, first, to 
ungrouped data, using the figures for personal incomes ( X ) 
and passenger car registration (F), by states. The values 
required for this computation, as given in Table 86, are 


N = 38 
2(X) = 1,242 
2(F) = 9,549 
2(A 2 ) = 65,236 
2(F 2 ) = 3,610,185 
2 (XY) = 451,558. 


The mean product may be computed from the formula 


2Qn/) _ 2(xV) 
N N 


CxCy. 


We may select as arbitrary origin the actual origin on the 
two original scales. Hence we have 


V = 


2 (XY) 

N 


CzCy 


(When the arbitrary origin is at zero on the original scales, the 
symbol X corresponds to x' and Y corresponds to y as used in 
the formulas.) 


For the two standard deviations 


(Footnote 1 continued from page 351) 

y = deviation of same point from true mean of y’s 
Cy = difference between true and assumed means of y’s 

x 1 = X + c x 
v' = y + c v 

x'y' = (x + C-)(l / + Cy) = xy + C x y + CyX + CxCy. 

For the sum of all such products for N points, we have 

2 (x'y') = Z(xy) + c»2(y) + Cy Z(x) +NcxCy. 

But 2(y) =■ 0 and 2(x) = 0. 

Therefore 2(x'y') = 2 (xy) + NcxCy 

2 (x'y') 2(xy) , . _ 

n ir +CxCv 
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These measures may be computed readily from the values 
secured from Table 86 : 


c x = = 32.684 


C = = 251.289 


38 38 

c, s = 1,068.2439 ^ = 63,146.1615 

451,558 


V = 


38 

= 3,670.0753 


- 8,213.1297 


j / 65,236 , nfia o/ion 

Ox = y — 5 ^ 1,068.2439 o t 


38 


u — y 


3,610,185 


38 


- 63,146.1615 


= 25.47 


= 178.49 


Solving for the coefficient of correlation, 


p _ 3,670.0753 

OxO •„ 25.47 X 178.49 


+ .8073. 


The equation to the straight line which describes the 
average relationship between X and Y may be derived 
from the values required for the preceding calculations. 
When the origin is at the point of averages this equation 
may be written 

y — r — x. 

Ox 


Substituting the proper values, we have 


y = + 8073 


178.49 

X 

25.47 


= 5.657x. 


This, with an insignificant difference, is the equation secured 
by the method of least squares. The constant term repre- 
senting the ^/-intercept disappears, since the origin is at 
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the point of averages, through which the least squares line 
must pass. 1 

When the product-moment method is employed in com- 
puting the coefficient of correlation and in determining the 
equation of regression, the standard error, S y , may be 
derived by a simple change in the formula first presented 
for r. From the expression 



we may secure the formula 

Sy = (TyVl — r 2 

which enables us to compute S V) if we have the values of 
<r v and r. In the present case, 

S v = 178.49\/1 - .8073 
= 105.3. 


The Product-Moment Method, Classified Data 

The product-moment method is also applicable to cases 
in which it is necessary to construct a double frequency or 


1 That the formula y = r— x is equivalent to the formula based upon the 
ax 

method of least squares may be readily demonstrated. When the line passes 
through the point of averages, the equation, Y = a + bX, becomes y = bx. 

But b = We may write, accordingly, y e - ^ V l x - 

2 (a: 2 ) ) 


This is equivalent to 

for the latter may be written 
( 1 ) Vc 


Vc 


Z(ay) <ry x 


( 2 ) Ve 


iWya a 

. s(sy) 


<Tx 


try 

■ r~x 
<r x 


( 3 ) Vc 


( 4 ) Vc 


2(xy) 


2(x2/) x 


Na x -a x ~ 2(* 2 ) 

(The symbol y c is employed for the computed value of y } in these equations, 
to avoid confusion with the actual y ' s which appear in the right-hand members 
of the equations.) 
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correlation table. The procedure is shown in detail in 
Table 90. 

This table is identical with that previously presented for 
the same data, except that a different arbitrary origin has 
been selected. 

The value 4.50 is adopted as the assumed mean of the 
X’s (M' x ), and the value 5.50 as the assumed mean of 
the F’s (M'y). Deviations are measured in class-interval 
units from this origin. In each compartment of the correla- 
tion table there are three figures, involved in the computation 
of 2 (x'y 1 ). The figure in the center indicates the number 
of items falling in that compartment. Thus there are 
seven pairs having X values between 5.75 and 6.25 (mid- 
point 6 . 0) and F values between 7 . 25 and 7 . 75 (mid-point 
7.5). For each of these pairs x' (the deviation from the 
assumed mean of the X’s) is + 3, in class-interval units, 
and y' (the deviation from the assumed mean of the F’s) 
is + 4, in class-interval units. For each pair, therefore, 
x'y' = + 12. This figure appears at the top of the compart- 
ment. But there are seven pairs in this compartment, so 
the sum of x'y' for this group is + 84. This figure appears 
in parentheses at the bottom of the compartment. To 
secure 2 (x'y') for the entire table it is necessary to add 
algebraically the values secured in this way for all com- 
partments. The addition is first carried out for the different 
rows, the sub-totals being given in the column at the right 
of the table. It is found that 2 (x'y') = + 4,492, in class- 
interval units. 

Details of the computation of the coefficient of correlation 
are given in Table 91 on page 358. The values of c* and c, 
are obtained by methods already familiar. 

We have, from that table, 

2 (xy) __ 2 (x'y') 

~N N~ ** 

= + 2.4277. 




856 
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This is the value of p ) the mean product, in class-interval 
units. Proceeding, 

r = S(xy) = p 

NfJxVy (JjPy 

= + .837. 

In computing r, both the numerator and denominator of 
the final fraction (the mean product and the two standard 
deviations) are in class-interval units. Since this is true, 
r may be computed directly without reducing the figures 
to the original units. The entire operation, therefore, is 
carried on in simple class-interval units. 

Table 91 


Calculation of the Coefficient of Correlation between the Discount Rales 
of Commercial Banks and of Federal Reserve Banks 
(Calculations based on the entries in Table 90) 


M'x m 4.60 

. - 746 
CX 5=8 1,800 

c,* = (- .414)* 
6,50 6 
1,800 
* Sx * - Cx* 

» 3.614 - .171 


&* 


- .414 
.171 
3.614 


- 3.443 
<r, « 1.856 

Mz *4.60- .5(.414) 
*4.293 


M\ 

Cy 


5.50 
- 29 6 
1,800 

« (- .164)* 


- .164 
.027 


— Sy* — Cy* 


* 2.470 - .027 
= 2.443 
<Ty = 1.663 

M v = 5.50 - .5(. 164) 
- 5.418 


2 W) 


- C*Cy 


N 

. 4,492 
* 1,800 

* 2.4956 - .0679 

* +2.4277 


[(-.414K-. 164)1 


VxV V 

+2.4277 

(1.855) (1.663) 
+ 2.4277 
2.8994 


r * + .837 


Note: The class-interval unit has been employed in all the computations 
shown in this table. 

In deriving the equation to the straight line which 
describes the average relationship between x and y from 
the formula 



<r y and <j x should be expressed in units of the original scales . 1 

1 When the class-intervals happen to be the same, as in the present case, 
the change is not necessary, as the relation between numerator and denomi- 
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This is done by multiplying the present values by the 
class-intervals. 


<r* (in original units) = 1.855 X .50 = .9275 
o v (in original units) = 1 . 563 X . 50 = . 7815. 

Substituting the given values in the formula, we have 


Co, -7815 

V — ■ 837 x 

y .9275 

= .705a;. 


The Lines of Regression 

In the above discussion certain terms ordinarily employed 
in the treatment of correlation have been purposely omitted. 
Several of these should be explained. 

The equation to the line of best fit in the preceding 
illustration was found to be 

y — ,705a; 

when the origin was taken at the point of averages. In 
this equation y is expressed as a function of x\ that is, x is 
taken to be the independent variable and y the dependent 
variable. The equation expresses the average variation in 
y (discount rates of commercial banks) corresponding to a 
change of one unit in x (discount rates of Federal Reserve 
Banks). This line of relationship corresponds precisely to 
a line of trend, which describes the average change in a 
given series accompanying a unit change in time. A line 
which thus describes the average relationship between two 
variables is termed a line of regression. Its equation is 

termed a regression equation, and the quantity r — which 

Ox 

gives the slope of such a line is called a coefficient of regression. 
The use of these terms dates back to early studies by 
Galton, dealing with the relation between the heights of 


nator is not altered. In practice it is advisable always to express the two 
standard deviations in original units at this stage of the calculations. 
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fathers and the heights of sons. Sons, Galton found, deviated 
less on the average from the mean heights of the race than 
their fathers. Whether the fathers were above or below 
the average, the sons tended to go back or regress towards 
the mean. He therefore termed the line which graphically 
described the average relationship between these two vari- 
ables the line of regression. The term is now used generally, 
as indicated above, though the original meaning has no 
significance in most of its applications. 

In any given case equations to two lines of regression 
may be computed. One is an expression of the average 
relationship between a dependent Y - variable and an inde- 
pendent X- variable; the other describes the relationship 
between a dependent X-variable and an independent 
F- variable. The significance of the two may be indicated 
graphically. 

Figure 72 is derived directly from the scatter diagram 
presented in Fig. 71. The circle in each column represents 
the mean F-value of all the items falling in that column. 
Thus in the third column there are 40 cases including all 
those with X - values falling between 2.25 per cent and 
2 . 75 per cent. The F-values vary, however, being distributed 
as shown in Table 92. 


Table 92 


Computation of the Arithmetic Mean of an Array 


Class-interval 

Mid-point 

TO 

Frequency 

f 

/to 

4.75 

- 5.24 

5.0 

4 

20.0 

4.25 

- 4.74 

4.5 

16 

72.0 

3.75 

- 4.24 

4.0 

19 

76.0 

3.25 

- 3.74 

3.5 

1 

3.5 




40 

171.5 


M = = 4.2875. 

40 

Similar mean values are obtained for the other columns. 
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These are plotted in Fig. 72, together with the line of 
regression of Y on X. 

In Fig. 72 the X-variable (Federal Reserve bank discount 
rates) is independent. As it increases from 4.0 per cent 
to 4.5, 5.0, 5.5 per cent, and so on, the average of com- 
mercial bank rates increases also. An average commercial 



1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 7.25 
Means of (3.60K3.90) (4.28H4.56K4.86) (5.1 1) (5.60X 5.96X6.40) (6.69)<7.25)<7.02) 
Columns Federal Reserve Bank Rates— Percent 


Fio. 72. — Showing the Relation between Discount Rates of Commercial 
Banks and Federal Reserve Bank Discount Rates. (The broken line 
connects the means of the columns and the straight line shows the average 
change in commercial bank rates corresponding to a unit change in Federal 
Reserve bank rates; i.e., it represents the regression of Y on X) 

bank rate of 4.29 per cent was associated with an average 
Federal Reserve bank rate of 2.5 per cent; an average 
commercial bank rate of 4.56 per cent was associated with 
an average Federal Reserve bank rate of 3.0 per cent, 
and so on. (The commercial bank rates cited are the means 
of the entries in the columns referred to.) The slope of 
the straight line, which is the line of regression or the line 
of average relationship, measures the average increase in 
commercial bank rates corresponding to a unit increase in 
Federal Reserve bank rates. 

It is possible to view the relationship between these two 
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variables in another light. These questions arise: Given 
a certain commercial bank discount rate, what is the average 
Federal Reserve bank rate associated with it? And for a 
given change in commercial bank discount rates, what is 
the average change in the corresponding Federal Reserve 
bank rates? The commercial bank rate is now looked upon 
as independent, and the Federal Reserve rate as an associ- 
ated dependent variable. These questions are answered by 
Fig. 73. The points marked by the small circles and con- 



Federal Reserve Bank Rates — Percent 


Fig. 73. — Showing the Relation between Federal Reserve Bank Discount 
Rates and the Discount Rates of Commercial Banks. (The broken line 
connects the means of the rows and the straight line shows the average 
change in Federal Reserve bank rates corresponding to a unit change in 
commercial bank rates; i.e., it represents the regression of X on Y) 

nected by the broken line show the locations of the arithmetic 
means of the items falling in the various rows. Thus the 
16 X-items in the bottom row have an average value of 
2.75 per cent. This is the average Federal Reserve bank 
discount rate associated with a commercial bank rate of 
3.5 per cent. The average Federal Reserve bank rate 
associated with a commercial bank rate of 4.0 per cent is 
2.93 per cent, and so on. The straight line fitted to these 
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points indicates the relationship between the two, its slope 
measuring the average increase (or decrease) in Federal 
Reserve bank rates associated with a unit change in com- 
mercial bank rates. 

This is the line of regression of X on Y. The general 
formula for the equation to this line is: 

(X x 

X = r — y. 
a v 

Substituting the present values, we have 

-9275 

x ~ 837 .7815 y 


or 

x = .993^. 

The factors in this equation, it will be seen, are the same as 
those entering into the formula for the line of regression of 
y on x 1 . If r is equal to 1 the two lines coincide, and if, 
in addition, the two standard deviations are equal, the line 
of regression will bisect the angle formed by the axes. 
If the points be plotted on a chart scaled in units of the 
standard deviations, we have y = rx; the slope of the line 
of regression is then equal to the value of r. 

The coefficient of regression is represented by the sym- 
bol b. In a simple correlation problem there are two such 
coefficients, representing the slopes of the two lines of 


1 The formula x — r - \j 

<Ty 

2 ( X\J ) 

may be reduced to x = y- 

This is the equation to a line fitted to the points plotted in Fig. 73 in such a 
way that the sum of the squares of the horizontal deviations is a minimum. 
The formula 


y = 


S(sj/) 

2{x*) X 


is the equation to the line for which the sum of the squares of the vertical 
deviations is a minimum. An understanding of this point may make clear 
the difference between the two lines of regression. 
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regression. These are 



<*v 


(The subscripts indicate the relation between the two varia- 
bles. The first subscript refers to the dependent variable in each 
case.) 

The coefficient r appears in both formulas. This being 
so, it is clear that r may be computed from the regression 
coefficients. For 



Thus if we know the slopes of the two lines of regression 
r may be determined. In the present example 

r = V.705 X .993 = .837 


USE OF THE EQUATIONS OF KEGRESSION 


The two equations of regression given above 


and 


y — .705a; 
x = .993y 


describe relations between deviations from the respective 
arithmetic means. That is, the origin is at the point of 
averages, and to use the equations we cannot use the 
original values of X and Y but must express them as devia- 
tions from their means. For example, we wish to determine 
the normal commercial bank rate associated with a Federal 
Reserve bank rate of 6 per cent. The mean value of the X- 
variable (Federal Reserve bank rates) is 4 . 293 per cent. A 
rate of 6 per cent represents a deviation from the mean 
of + 1 . 707. Substituting this value in the first of the 
above equations, we have 
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y = .705 X (+ 1.707) 

= + 1.203. 

This is the average y-deviation associated with an ^-deviation 
of + 1.707. To get the normal commercial bank rate 
associated with a Federal Reserve rate of 6 per cent the 
quantity + 1 . 203 per cent must be added to the mean 
commercial bank rate, 5 . 418 per cent. The value we wish 
is thus 6.621 per cent. 

This calculation has been rather round-about because 
of the form of the equation of relationship. This equation 
can be put in more appropriate form for such computations 
Let 

X = arithmetic mean of the X’s 
Y = arithmetic mean of the F’s. 

Then 


may be written 


y - r 


<Ty 

O’* 


x 


Y — Y — r— (X — X). 

<Tx 


In this last equation X and Y represent the values of the 
variables on the original scales, and not as deviations from 
their respective means. In terms of the coordinate chart, it 
means shifting the origin from the point of averages to a 
point corresponding to zero on each of the original scales. 

To illustrate the greater utility of the equation in this 
form, the equation 

y = .705x 

may be changed in the manner indicated. It becomes 

Y - 5.418 = .705(X - 4.293) 

= .705X - 3.027 
Y = 2.391 + .705X. 

This is the equation with the origin so shifted that the 
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original values may be employed directly. To determine 
the commercial bank rate normally associated with a Fed- 
eral Reserve rate of 6 per cent we may substitute the latter 
value in the equation just secured. 

Y = 2.391 + (.705 X 6.0) 

= 6.621. 

Precisely the same results are secured as with the equation 
in the other form, but for many purposes it is preferable 
to have an equation in which the actual values may be 
inserted. 

The equation 

Ox 

x = r — y 

Oy 

may be similarly changed to 

X — X — r— (7 — Y). 

Gy 


Summary op Correlation Procedure 

In the foregoing pages there have been presented two 
quite different methods of securing the values required in 
measuring the relationship between two variables. The 
steps in the two methods may be briefly summarized. The 
method of least squares is basic in both cases, but that term 
may appropriately be employed to describe the first method 
outlined, for the process of fitting the line is the first and 
fundamental step in that procedure. 

I. The Least Squares Method. 

A. Data to be handled as individual items. 

1. Fit a straight line to the data by the method of least 
squares. A simple arrangement of the data in columns 
will permit the ready computation of the required 
values, 2(X), 2 (Y), 2(X S ), 2 (K*), 2 (XY). The equa- 
tion thus obtained describes the average relationship 
between the two variables. 
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2. Compute the standard error of estimate, S„, from the 
formula 

O , _ 2(n - «2(F) - &2(XF) 

s » iv 

is a measure of the reliability of estimates based 
upon the equation of relationship, and is to be inter- 
preted in the same way as is the standard deviation 
about an arithmetic mean. 

3. Compute the coefficient of correlation, r, from the for- 
mula 



or from 

= gggo + b2(XY) - Ncl 
2(F 2 ) - Nc y 2 

Give r the sign of the constant b in the equation of regres- 
sion. This coefficient is an abstract measure of the degree 
of relationship between the two variables, in so far as this 
relationship may be described by a straight line. 

4. If an equation describing the regression of X on Y 
(X being dependent) is desired, the proper values may 
be substituted in the two normal equations 
2(X) = Na + 62(F) 

2 (IF) = a2(F) + 62(F 2 ). 

The equation secured will be of the type 
X = a + bY. 

The standard error of estimate, S x , may be computed by 
making the appropriate changes in the formula as given 
for S u . The value of r will be the same as in the pre- 
ceding case, in which Y is dependent. 

B. Data to be classified. 

1. Select an appropriate class-interval and tabulate the 
items in the form of a correlation table. 

2. Compute the necessary values for fitting a straight line 
to the data. In doing so, an arbitrary origin may be 
selected for each variable, and all values expressed in 
class-interval units. A re-arrangement in columnar form 
may facilitate the computation of the quantity 

S(X'F'). 
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II. 


3. Compute the standard error of estimate, employing the 
formula given above. 

4. Compute the coefficient of correlation from the formula 
given above. 

6. If the above calculations were carried on in class-interval 
units, the equation of average relationship and the stand- 
ard error of estimate should now be expressed in terms of 
the original units of measurement. If an arbitrary origin 
was employed, the equation should be corrected so that 
the variables relate to deviations from the true origin. 

The Product-Moment Method. 

A. Data to be handled as individual items. 

1. Arrange the paired observations in parallel columns and 
compute the quantities 2(X), 2(F), 2(X 2 ), 2(F 2 ), 
2 (XY). 

2. Divide these quantities throughout by N. For the first 
two of these quotients we may use the symbols c x and 


Cy (i.e., 

2(X) 


N 

and 

2(7) 


N 


— C x 


= Cy). 


3. Compute the mean product from the formula 

2(X7) 

V - N c *Cv 

4. Compute the two standard deviations from the formulas 

=1 /H2n, 


<Tz 


N 




2(7 2 ) 

N 


- Cy 2 ‘ 


5. Compute the coefficient of correlation from the formula 


r = 


V 

(T x (Ty 
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6 . 


7. 


8 . 


B. 

1 . 

2 . 

3. 

4. 

5. 


6 . 


7. 


Determine the equations of regression by substituting 
the proper values in the formulas 

Gy 

y = r — x 
(T x 

G x 

x = r — y- 

Gy 

(Note: For each of these equations the origin is at the 
point of averages.) 

If desired, transfer the origin to zero on the two original 
scales by substituting the arithmetic means in the 
equations 

Y - F= (X - X) 

(Tx 

X -X = r- (Y - Y). 

a v 

Compute the two standard errors of estimate from the 
formulas 

Sy = GyV 1 ~ r 2 

S x = g x y/ 1 — r 2 . 


Data to be classified. 


Construct a correlation table as in I. B. above. 

Select an assumed mean for each variable. Measure the 
deviations of the various items from the assumed means 
in class-interval units. 

Compute c x and c v in class-interval units. 

Compute g x and g v in class-interval units. 

Compute 2 {x'y') in class-interval units for each compart- 
ment of the correlation table. Total these figures to get 
H(x f y f ) for the whole table. 

Determine the value of the mean product in class-interval 
units from the formula 


V = 


N 


— CxCy* 


Compute r from the formula 



g x g v 
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8. Reduce <r* and <r„ to original units. 

9. Determine the equations of regression by substituting 
the proper values in the formulas 

<Jy 

y = r — x 

Oz 

and 

<r* 

x = r — y- 
<r y 

10. If desired, transfer the origin to zero on the two original 
scales from the formulas 

Y — Y - r—{X — X) 

(T x 

X — X = r — (F — Y)- 

<Ty 

11. Compute the two standard errors of estimate from the 
formulas 

S y = <r„Vl — r 2 
S z = <t z V 1 — r 2 . 

It is advisable, in all cases, to construct scatter diagrams 
and to plot the lines of regression thereon. It is generally 
possible to derive from such diagrams a truer idea of the 
relations involved, and of the adequacy of the methods 
employed, than may be obtained from a study of the figures 
alone. 


LIMITATIONS 

A question naturally arises as to the degree of generality 
attaching to the measures of relationship described in the 
preceding pages. Are they limited to certain types of dis- 
tributions, or may they be employed as absolutely general 
and universally valid measures? 

As we have seen, the standard deviation has a precise and 
definite meaning with respect to distributions following the 
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normal law. Having values of the mean and of the standard 
deviation, we know, in such cases, the exact percentage 
of observations which will fall within any stated limits. 
If the distribution departs from the normal type the standard 
deviation is still a useful measure, but it cannot be inter- 
preted in the same exact sense. Bearing this in mind, the 
formula 



may be considered. 

When the distribution of the original values of the 
dependent variable about their mean is normal and the 
distribution about the least squares line is normal, both 
S y and <x v have specific and exact meanings, and it is per- 
fectly legitimate to compute such a measure as r, based 
upon the relation of one to the other. Departures from 
normality in either case reduce the significance of this 
comparison. But we have seen that the standard deviation 
remains a useful measure even though the departure from 
the normal type be fairly pronounced, though in the latter 
case it lacks the precise significance attaching to it in a 
normal distribution. In the same way the standard error of 
estimate and the coefficient of correlation may be computed 
and utilized, even when all the requirements of normality are 
not met. Care must be taken in their interpretation in 
such cases, however. It must be clearly recognized that 
these measures have their full significance only in cases 
where the original distribution of the dependent variable 
and the distribution about the least squares line are both 
normal, or approximately so. 

A simple example may make clear the effect upon the 
value of the coefficient of correlation of an extreme departure 
from a normal distribution. In the following table are 
listed certain selected figures taken from the 1919 Census 
of Manufactures, for the State of New York. 
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Table 93 


Wage-Earners in Factories and Value of Products, 1919, in Eleven 
Cities in the State of New York 



Number of wage- 

Total value of 

City 

earners (in 

; products (in 

thousands) 

millions of dollars) 


(X) 

(Y) 

Batavia 

2.2 

9 

Beacon 

2.2 

10 

Coming 

3.5 

11 

Geneva 

2.5 

10 

Glens Falls 

2.8 

12 

Ithaca 

1.7 

10 

Middletown 

2.2 

10 

Peekskill 

2.1 

11 

Rensselaer 

1.4 

10 

Tonawanda 

1.8 

16 

New York City 

638.8 

5,261 


When the first ten of these cities, in the order listed, are 
treated as a group, the following values are secured: 

o v = 1.8682 
S v = 1.8669 
r = - .034. 

(No general significance is to be attached to the above 
coefficient of correlation, for the cities were selected for the 
purpose of illustrating a particular point.) 

The ten points and the line of regression are plotted in 
Fig. 74. 

When New York City is included in the group, the values 
secured for the sample of eleven cities are 

o v = 1509.3 
S v = 7.53 
r = + .999988. 

The eleven points and the line of regression are plotted in 
Fig. 75. 

The reason for the markedly different results is obvious. 
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When the one very large city is included with the ten small 
cities the standard deviations of both variables are greatly 
increased. That of the F-variable (value of products) is 
increased from 1.8682 to 1509.3. But S V) the measure of 
the scatter about the fitted line, undergoes no such pro- 
Millions 


of 



1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 
Thousands of Wage Earners 


Fig. 74. — Showing the Relation between Number of Wage-Earners in 
Factories and Value of Products in Ten Selected Cities in the State of 
New York 

nounced change in value. For the ten cities it is 1.8669; 
for the eleven cities 7 . 53. This is due to the fact that the 
one exceptional case is given such a great weight, in fitting 
by the method of least squares, that the fitted line must 
pass through or very near the point representing this 
observation. Accordingly, S is always affected less than 
a by a single very exceptional case. Since the value of r 
depends upon the relationship 



the presence of such a case always tends to increase the 
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value of the measure of correlation. The introduction of 
the one exceptional case in the above example changes a 
correlation coefficient of virtually zero to one of unity. 
The result, of course, is meaningless. 

While this example represents an extreme instance, the 
same distortion will be felt, to a greater or less degree, 
whenever there is a departure from a normal distribution. 



Thousands of Wage Earners 


Fig. 75. — Showing the Relation between Number of Wage-Earners in 
Factories and Value of Products in Eleven Selected Cities in the State of 
New York 

In practice the various measures of relationship cannot 
be restricted to perfectly normal distributions, but they 
must be interpreted with care when there is reason to believe 
that such disturbing influences are present. 

The Coefficient of Rank Correlation 

The coefficient of rank correlation is a measure of rela- 
tionship not subject to the distortion introduced by material 
departures from normality, and one which is particularly 
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useful in providing an objective test of the existence of 
correlation. Its application calls merely for the orderly 
ranking of observations. Thus we may rank 47 states of 
the union 1 according to the number of individual income 
tax returns in 1934, and according to the number of pass- 
enger automobiles registered in that year. The results are 
shown iD Table 94. 


Table 94 

Illustrating the Computation of the Coefficient of Rank Correlation 

(1) (2) (3) (4) (5) 

Rank on basis of Rank on basis of 


State 

number of indi- number of pas- 
vidual income senger autc mo- 

Difference 
(2) - (3) 



tax returns in biles registered 

d 

d 2 

Nevada 

1934 in 1934 

1 1 

0 

0 

Wyoming 

2 3 

- 1 

1 

New Mexico 

3 4 

- 1 

1 

S. Dakota 

4 15 

- 11 

121 

Idaho 

5 9 

- 4 

16 

N. Dakota 

6 12 

- 6 

36 

Vermont 

7 5 

+ 2 

4 

Delaware 

8 2 

+ 6 

36 

Arizona 

9 6 

+ 3 

9 

Utah 

10 7 

+ 3 

9 

Mississippi 

11 13 

- 2 

4 

Arkansas 

12 16 

- 4 

16 

1 Washington is excluded, because the published income tax returns for that 

state include those of Alaska. 

Following are the records for the nine states not listed in Table 86. 



No. of taxable personal 

No. of passenger auto- 


incomes ( in thou- 

mobiles registered (in 


sands) 1934 

thousands) 1934 

California 

316 

1,769 


Illinois 

310 

1,282 


Massachusetts 

243 

687 


Michigan 

139 

1,026 


New Jersey 

211 

741 


New York 

808 

1,971 


Ohio 

210 

1,453 


Pennsylvania 

342 

1,466 


Texas 

119 

1,086 
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Table 94 — Continued 


Illustrating the Computation of the Coefficient of Rank Correlation 


(1) 

(2) 

(3) 

(4) 

(5) 


Rank on basis of 

Rank on basis of 




number of indi- 

number of pas- 

Difference 


State 

vidual income 

senger automo- 

(2) - (3) 



tax returns in 

biles registered 

d 

cP 


1934 

in 1934 



S. Carolina 

13 

18 

- 5 

25 

New Hampshire 

14 

8 

+ 6 

36 

Montana 

15 

10 

+ 5 

25 

Maine 

16 

14 

+ 2 

4 

Alabama 

17 

19 

- 2 

4 

Nebraska 

18 

30 

- 12 

144 

Oregon 

19 

21 

- 2 

4 

W. Virginia 

20 

17 

+ 3 

9 

Colorado 

21 

22 

- 1 

1 

Rhode Island 

22 

11 

+ 11 

121 

N. Carolina 

23 

31 

- 8 

64 

Florida 

24 

23 

+ 1 

1 

Kentucky 

25 

25 

0 

0 

Kansas 

26 

33 

- 7 

49 

Louisiana 

27 

20 

+ 7 

49 

Tennessee 

28 

26 

+ 2 

4 

Georgia 

29 

29 

0 

0 

Oklahoma 

30 

32 

- 2 

4 

Virginia 

31 

28 

+ 3 

9 

Iowa 

32 

35 

- 3 

9 

Minnesota 

33 

36 

- 3 

9 

Indiana 

34 

38 

- 4 

16 

Maryland 

35 

24 

+ 11 

121 

Connecticut 

36 

27 

+ 9 

81 

Wisconsin 

37 

34 

+ 3 

9 

Missouri 

38 

37 

+ 1 

1 

Texas 

39 

42 

- 3 

9 

Michigan 

40 

41 

- 1 

1 

Ohio 

41 

44 

- 3 

9 

New Jersey 

42 

40 

+ 2 

4 

Massachusetts 

43 

39 

+ 4 

16 

Illinois 

44 

43 

+ 1 

1 

California 

45 

46 

- 1 

1 

Pennsylvania 

46 

45 

+ 1 

1 

New York 

47 

47 

0 

0 


1,094 
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The degree of correlation is indicated by the degree of 
concordance between the two rankings. A precise measure 
of correlation is provided by the coefficient 

, 62rf 2 

P T = 1 i 

n 3 — n 

where d is a difference between the rankings of a given 
state in columns (2) and (3), and n is the number of states 
included. 1 (The Greek letter rho (p) with subscript r is used 
as the symbol of this coefficient.) 

The method of computation is shown in Table 94. From 
the measurements there given we have 

_ . 6 X 1,094 _ 6,564 

Pr (47) 3 - 47 103,776 

= .94. 

1 This formula may be derived from the familiar product-moment formula 
for the coefficient of correlation, simplified because of the fact that the sums 
of the squares of the deviations of the first n natural numbers from their mean 

, , n 3 — n 
is equal to — — 

1a 

If we let d equal the difference between the rank of one variable and the 
corresponding rank of the other, we have, for any given pair of observations, 
d = X — Y = x — y (since the means of the two 
series of ranks are identical) 

2d 2 = 2(x - t/) 2 = 2s 2 -f St/ 2 - 22xt / 

22x7/ - 2x 2 + 2 y 2 - 2d 2 . 

n s — n n l — n 

But 2x 2 * — and 2 y 2 = — — — 

12 12 

Therefore 2Zxy — — 2d 2 

12 

Butp r = — ■ (the product-moment formula for r) 

V2* ! - 2y* 

sa . ■ ii — i ... ■ 

w 3 — n 
12 

w x _ 62d 2 ^ 

7? 3 — n 
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The coefficient of rank correlation is appropriate where 
it is possible to rank individuals, or other entities, on the 
basis of abilities or qualities not open to exact measurement. 
It is also well adapted for use where the distributions of 
the observations depart widely from the normal type, and 
where the usefulness of customary measurements would 
be seriously impaired. This point takes on particular im- 
portance in connection with tests of significance, involving 
generalizations from sample results . 1 Such tests are discussed 
in later chapters. 


REFERENCES 

Bowley, A. L., Elements of Statistics, Part II, Chaps. 6, 7. 

Brunt, David, The Combination of Observations, Chap. 10. 

Burgess, Robert, The Mathematics of Statistics, Chap. 10. 

Camp, B. H., The Mathematical Part of Elementary Statistics, 
Part I, Chaps. 8, 9. 

Chaddock, R. E., Principles and Methods of Statistics, Chap. 12. 

Croxton, F. E. and Cowden, D. J., Practical Business Statistics, 
Chap. 19. 

Crum, W. L. and Patton, A. C., An Introduction to the Methods 
of Economic Statistics, Chaps. 15, 16. 

Davies, G. R. and Yoder, Dale, Business Statistics, Chap. 6. 

Day, E. E., Statistical Analysis, Chaps. 12, 13. 

Elderton, W. P., Frequency Curves and Correlation, Chap. 7. 

Ezekiel, Mordecai, Methods of Correlation Analysis, Chaps. 1-5, 
7-9. 

Florence, P. S., The Statistical Method in Economics and Political 
Science, Chap. 9. 

Galton, Francis, “Correlations and their Measurement.” Pro- 
ceedings of the Royal Society, Vol. XLV, 1888 (136-145). 

Jones, D. C., A First Course in Statistics, Chap. 10. 

Kelley, Truman L., Statistical Method, Chap. 8. 

Moore, H. L., Forecasting the Field and the Price of Cotton, 
Chap. 2. 

Pearl, Raymond, Introduction to Medical Biometry and Statis- 
tics, Chap. 14. 

1 See Harold Hotelling and Margaret Pabst, “ Rank Correlation and Tests 
of Significance Involving No Assumption of Normality.” Annals of Mathe- 
matical Statistics, Vol. 7, 1936. 29-43. 



REFERENCES 


S79 


Pearson, Karl, The Grammar of Science, Chaps. 4, 5. 

Pearson, Karl, “Regression, Heredity and Panmixia.” Philo- 
sophical Transactions, Royal Society. Series A. Vol. CLXXXVIT, 
1896 (253-318). 

Richardson, C. H., An Introduction to Statistical Analysis, 
Chap. 7. 

Rietz, H. L. and Crathorne, A. R., “Simple Correlation” (in 
Handbook of Mathematical Statistics, Rietz, H. L. ed., Chap. 8.) 

Smith, J. G., Elementary Statistics, Chaps. 20, 21. 

Snedecor, G. W., Statistical Methods Applied to Experiments in 
Agriculture and Biology, Chaps. 6, 7. 

Tippett, L. H. C., The Methods of Statistics, Chap. 7. 

Walker, Helen M., Studies in the History of Statistical Method, 
Chap. 5. 

Waugh, A. E., Elements of Statistical Method, Chap. 9. 

Whittaker, E. T. and Robinson, G., The Calculus of Observations, 
Chap. 12. 

Yule, G. U. and Kendall, M. G., An Introduction to the Theory 
of Statistics, Chaps. 11, 12. 



CHAPTER XI 


THE MEASUREMENT OF RELATIONSHIP 
BETWEEN TIME SERIES 

The methods of measuring correlation described in the 
preceding chapter were devised originally for the analysis 
of non-historical data, that is, for the treatment of frequency 
series rather than time series. The measurement of corre- 
lation between series in time presents certain distinctive 
problems which require separate treatment. 

We have seen that such series are affected by various 
forces, which have been classified as the secular trend, 
cyclical and seasonal fluctuations and accidental variations, 
and methods have been described by means of which the 
effects of these various forces may be isolated. This breaking 
up of a series into its component parts for separate study 
is essential in attempting to correlate series in time, for 
spurious and quite misleading results will be secured if 
this is not done. The problem of correlation is that of 
securing a precise measure of the degree of relationship 
between variable quantities. But each series in time repre- 
sents the combination of a number of variables and, so 
far as possible, each should be treated separately in corre- 
lating such series. 

The relationship between two time series as, for example, 
interest rates and bond prices, may be studied with respect 
to any or all of the following components: 

а. Secular trend. 

б. Cyclical fluctuations. 

e. Seasonal fluctuations. 

d. Changes from one time unit to the next (e.g., week to week, 
month to month, or year to year). 

880 
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Such relationships may be studied, first, through the 
comparison of graphs, and much may be learned by this 
simple process. The similarity or dissimilarity of secular 
trends, and the general relation between cyclical movements 
may be determined by a study of such graphs. For more 
accurate comparison the coefficient of correlation may be 
used, but when it is so employed it is particularly impor- 
tant that the precise nature of its employment and the 
exact significance of the results be understood. 

For the comparison of secular trends the coefficient of 
correlation would never be employed. The mere fact that 
two series have the same secular trend is no indication 
of a relationship of interdependence; a coefficient of correla- 
tion based upon the trend values would be meaningless. 
Moreover, much simpler methods are available for comparing 
trends. 

For the same reason a coefficient of correlation should 
not be based upon the original absolute values of two 
series in time, except in the rather rare case in which neither 
series is marked by a definite secular trend. The computa- 
tion of r, when dealing with ordinary statistical data, 
involves measuring the deviations of all the items from 
their respective arithmetic means, and securing the sum 
of the products of the paired deviations. When deviations 
of like sign are paired throughout r will have a positive 
value; when deviations of unlike signs are paired throughout 
r will have a high negative value. The presence of pro- 
nounced rising or declining secular trends makes it impossi- 
ble to secure significant values for r by the employment of 
this method. For example, the relation between automobile 
production and the price of bacon between the years 1900 
and 1920 might be measured. The secular trend is markedly 
rising in each case. When the deviations of the annual 
figures are measured from the arithmetic means of the 
two scries, the paired items for the earlier years will be 
negative, for the later years positive. A fairly high positive 
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value for r would be secured, were the computation carried 
through on this basis. This value would be quite misleading, 
for no real relationship can be expected in this case. The 
coefficient of correlation in such a case would measure, 
primarily, the relation between the two secular trends. 

This coefficient might conceivably be employed to deter- 
mine the similarity between seasonal fluctuations in two 
series, but its utility for this purpose may be questioned. 
Here again other and simpler methods are available. 

In practice, therefore, the device of correlation should 
be employed neither to measure the relation between secular 
trends nor between seasonal movements. Its use is confined 
to comparisons of two or more series with respect to cyclical 
fluctuations and with respect to the short time changes 
from month to month or year to year. And, if valid measures 
of correlation are to be secured in making such comparisons, 
the effects of forces which distort these comparisons should 
be eliminated, in so far as this is possible. The actual work 
of correlation must be preceded by a sifting process designed 
to remove such irrelevant material. Unless the data are 
thus “distilled” the interpretation of the resulting coeffi- 
cients will be difficult. 

The Measurement of Correlation between Cyclical 

Fluctuations 

In an earlier chapter we have dealt with methods by 
which the effects of certain of the factors affecting time 
series might be measured and eliminated. The spurious 
correlation due to secular trend may be avoided by measur- 
ing the deviations of the observations not from the respec- 
tive arithmetic averages but from the lines of secular trend 
of the two series. These variations, the deviations from 
trend, are the significant values if our interest centers in the 
cycles. If annual values are employed the problem of elim- 
inating seasonal fluctuations is not faced. 

To illustrate this method of measuring the relationship 
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between series in time we may undertake to determine 
whether there is any connection between cyclical fluctuations 
in cotton production and in cotton prices. Figures for 
crop years are to be employed, for the period 1901-02 to 
1935-36. 

Cotton prices require some correction before correlation 
is attempted. The raw figures with which the investigation 



Fia. 76. — Cotton Production in the United States, Crop Years 1901-1902 
to 1935-1936, with Line of Trend 


starts are average spot prices at New York for middling 
upland cotton, at wholesale, from September to May of 
each crop year. But such prices reflect not only the effects 
of varying conditions in the cotton market, but also changes 
in the general level of prices. To eliminate the effect of 
this factor the original prices are deflated by Bradstreet’s 
price index, as computed for the September-May period in 
each crop year. For this purpose, Bradstreet’s index has 
been reduced to relative terms, with the average for the 



Table 95 


( 1 ) 


Crop 

year 


1901 - 02 

1902 - 03 

1903 - 04 

1904 - 05 

1905 - 06 

1906 - 07 

1907 - 08 

1908 - 09 

1909 - 10 

1910 - 11 

1911 - 12 

1912 - 13 

1913 - 14 

1914 - 15 

1915 - 16 

1916 - 17 

1917 - 18 

1918 - 19 

1919 - 20 

1920 - 21 

1921 - 22 

1922 - 23 

1923 - 24 

1924 - 25 

1925 - 26 

1926 - 27 

1927 - 28 

1928 - 29 

1929 - 30 

1930 - 31 

1931 - 32 

1932 - 33 

1933 - 34 

1934 - 35 

1935- 36 


Cotton Production and Cotton Prices , 1901-1936 
( 2 ) ( 3 ) ( 4 ) ( 5 ) 


Cotton produc- 
tion in United 
States , excluding 
linters (in thou- 
sands of bales) 


9,510 

10,631 

9,851 

13,438 

10,575 

13,274 

11,107 

13,242 

10,005 

11,609 

15,693 

13,703 

14,156 

16,135 

11,192 

11,450 

11,302 

12,041 

11,421 

13,440 

7,954 

9,762 

10,140 

13,628 

16,104 

17,977 

12,956 

14,478 

14,825 

13,932 

17,096 

13,002 

13,047 

9,636 

10,443 


Cotton prices. 
Average of spot 
prices in N. Y. 
for middling 
upland cotton , 
Sept . to May 
(in cents per 
pound) 

8.64 

9.50 

13.20 

8.69 

11.40 
10.97 

11.41 
9.81 

14.62 

14.80 

10.34 

12.35 
13.40 

8.63 
12.04 
18.29 
29.96 
30 06 

38.63 
16 90 
18.67 
26.26 
31.79 
24.34 
20 60 
14.26 
20.19 

20.02 

17.00 
10.47 

6.42 

6.75 

10.95 

12.42 
11.59 

881 


Bradstreet’s 
price index , 
average , Sept, 
to May 

( 1913-14 - 100 ) 


86.2 

90.0 

88.6 

89.3 

92.3 

98.8 

93.2 

91.3 

100.6 

97.8 

100.0 

104.8 

100.0 

105.2 

121.2 

151.0 

197.9 

203.1 

226.3 

152.9 

127.2 

149.7 

145.3 

150.4 

153.8 

141.0 

149.2 

145.1 

132.1 

107.6 

86.1 

76.2 

100.6 

106.7 

112.6 


Cotton prices , 
deflated 
(in cents per 
pound) 


10.02 

10.56 

14.90 

9.73 

12.35 

11.10 
12.24 
10.74 

14.53 

15.13 
10.34 
11.78 
13.40 

8.20 

9.93 

12.11 

15.14 

14.80 

17.07 
11.05 
14.68 

17.54 
21.88 
16.18 
13 39 

10.11 
13.53 

13.80 
12.87 

9.74 
7.46 
8.85 

10.89 

11.64 

10.29 
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crop year 1913-14 equal to 100. The original figures for 
the two series to be correlated, together with the corrected 
price figures, are given in Table 95. 

These data are plotted in Figs. 76 and 77. Lines of trend 
fitted to the two series are shown on the charts. 1 



Fig. 77. — Prices of Middling Upland Cotton in New York, Crop Years 
1901-1902 to 1935-1936, with Line of Trend. (Figures relate to average 
annual prices, during crop years, deflated by Bradstreet’s index of whole- 
sale prices) 

The deviation of each annual item from the secular trend 
of the given series is now to be measured, and the coefficient 
of correlation between these deviations is to be calculated. 
The computations appear in Table 96. 

This value of — .648 for the coefficient indicates a fair 
degree of negative correlation between deviations of cotton 
production in the United States from the line of trend and 

1 The equation to the line of trend of cotton production is 

Y = 13,009.14 + 87.96X - 4.640X 2 - .1491X 3 , with origin at 1918-19. 

The trend equation for deflated cotton prices is 

X =13.96 + .152X — .01425X 2 — .00083X 3 , with origin at 1918-19. 



Table 96 


Compulation of Coefficient of Correlation, Cotton Production and 

Cotton Prices 


(1) 

(2) 

(3) 

(4) 

(5) 


(6) 

Crop 

year 

1901-02 

Deviation of Deviation of 
cotton pro- deflated cot - 
duction from ton prices 
trend (in from trend 
1,000's of (in cents 

bales) per lb.) 

x y 

- 1,395 - 1.32 

X* 

1,946,025 

y 4 

1.7424 

+ 

xy 

1,841.40 

1902-03 

- 393 

- .72 

154,449 

.5184 

+ 

282.96 

1903-04 

- 1,298 

+ 3.63 

1,684,804 

13.1769 

— 

4,711.74 

1904-05 

+ 2,161 

- 1.59 

4,669,921 

2.5281 

— 

3,435.99 

1905-06 

- 834 

4- .95 

695,556 

.9025 

— 

792.30 

1906-07 

+ 1,731 

- .42 

2,996,361 

.1764 

— 

727.02 

1907-08 

- 572 

+ .57 

327,184 

.3249 

— 

326.04 

1908-09 

+ 1,428 

- 1.11 

2,039,184 

1.2321 

— 

1,585.08 

1909-10 

- 1,945 

+ 2.49 

3,783,025 

6.2001 

— 

4,843.05 

1910-11 

- 476 

+ 2.87 

226,576 

8.2369 

— 

1,366.12 

1911-12 

+ 3,476 

- 2.14 

12,082,576 

4.5796 

— 

7,438.64 

1912-13 

+ 1,357 

- .93 

1,841,449 

.8649 

— 

1,262.01 

1913-14 

+ 1,648 

4* • 45 

2,715,904 

.2025 

+ 

741.60 

1914-15 

+ 3,542 

- 4.98 

12,545,764 

24.8004 

— 

17,639.16 

1915-16 

- 1,516 

- 3.47 

2,298,256 

12.0409 

+ 

5,260.52 

1916-17 

- 1,366 

- 1.50 

1,865,956 

2.2500 

+ 

2,049.00 

1917-18 

- 1,615 

+ 1.35 

2,608,225 

1.8225 

— 

2,180.25 

813.12 

1918-19 

- 968 

+ .84 

937,024 

.7056 

— 

1919-20 

- 1,671 

+ 2.97 

2,792,241 

8.8209 

— 

4,962.87 

1920-21 

+ 275 

- 3.15 

75,625 

9.9225 

— 

866.25 

1921-22 

- 5,273 

+ .41 

27,804,529 

.1681 

— 

2,161.93 

1922-23 

- 3,515 

+ 3.25 

12,355,225 

10.5625 

— 

11,423.75 

1923-24 

- 3,174 

4 7.62 

10,074,276 

58.0644 

— 

24,185.88 

1924-25 

+ 290 

+ 2.00 

84,100 

4.0000 

+ 

580.00 

1925-26 

+ 2,758 

- .65 

7,606,564 

.4225 

— 

1,792.70 

1926-27 

+ 4,637 

- 3.73 

21,501,769 

13.9129 

— 

17,296.01 

1927-28 

- 360 

- .04 

129,600 

.0016 

+ 

14.40 

1928-29 

+ 1,202 

+ .58 

1,444,804 

.3364 

+ 

697.16 

1929-30 

+ 1,608 

+ .07 

2,585,664 

.0049 

+ 

112.56 

1930-31 

+ 793 

- 2.56 

628,849 

6.5536 

— 

2,030.08 

1931-32 

+ 4,055 

- 4.24 

16,443,025 

17.9776 

— 

17,193.20 

1932-33 

+ 80 

- 2.17 

6,400 

70,225 

4.7089 

— 

173.60 

1933-34 

+ 265 

+ .66 

.4356 

+ 

174.90 

1934-35 

- 2,982 

+ 2.30 

8,892,324 

5.2900 

— 

6,858.60 

1935-36 

- 1,988 

+ 1.94 

3,952,144 

3.7636 

— 

3,856.72 


171,865,603 227.2511 - 128,167.61 
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Table 96 — Continued 

Computation of Coefficient of Correlation, Cotton Production and 

Cotton Prices 

. /Zx* . /171, 865,603 _ 

C *=\-N = V 35 =2 ’ 216 ° 

. /Zy’ . /227.2511 „ 

° v= V N = V = 2 548 

r = _l?y_ _ — 128,167.61 _ 

Mr* <r„ 35 X 2,216 . 0 X 2 . 548 ' 


the corresponding deviations of cotton prices in New York, 
during the period covered. 

From the values already computed we may derive an 
equation for estimating the variation in cotton price associ- 
ated with a given variation in production. This regression 
equation, as we have seen, is of the type 


y = 



x. 


In the present case y and x refer to deviations from the 
parabolic lines of trend. Substituting the given values, we 
have 


y = - .648 


2.548 

2,216 


x 


y — — .00074a:. 

This equation means that, on the average, a unit devia- 
tion of cotton production (x) above the line of trend was 
accompanied by a deviation of . 00074 units in cotton prices 
(; y ) below the line of trend. The unit employed in the 
production figures was 1,000 bales, in the deflated price 
figures, one cent. In the interpretation of the equation 
it may be simpler to use an x-unit of one million bales, 
making the equation of regression 


y = — ,74x. 

Thus a cotton crop one million bales above trend was 
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accompanied by prices about three quarters of a cent per 
pound below trend (with reference always to deflated 
prices). This was the average relationship during the 
period 1901-1936. It did not hold in all cases, as is shown 
by the fact that r has a value of but — .648. If this, or 
a similar law, held perfectly, r would have a value of — 1. 

The value of S, which measures the scatter about the 
ling of regression, may be computed from the formula 

S y — cr y V 1 — r 2 . 

In the present case, S v has a value of 1.94 cents. The 
significance of this measure has been explained in an earlier 
section. 

(It should be emphasized that the use of the above 
equation for estimating future prices is dependent upon 
the validity of projecting the two lines of secular trend.) 

In the preceding analysis deviations were measured in 
absolute units, and the results could be interpreted only 
in terms of absolute units, bales of cotton and cents per 
pound. For certain purposes it might have been more 
convenient to correlate percentage deviations from the two 
lines of trend, in which case the standard deviations and 
the equation of regression would have been expressed in 
these terms. The procedure, in this respect, will depend 
in part upon the use to which the results are to be put. 
The nature of the data will also affect a decision on this 
point. The use of percentage rather than absolute deviations 
would be desirable in handling series in which the range 
of absolute deviations had changed materially during the 
period covered. 

It is obvious that in the above problem there is an 
arbitrary element which was not present in the correlation 
problems previously studied. The deviations are measured 
from lines of trend, not from the arithmetic means, and 
these lines of trend are arbitrarily selected. The use of 
different lines of trend might give quite different results. 
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In the above example the lines of trend were both power 
curves of the third degree. We might, perhaps with equal 
reason, assume that the underlying trends are best defined 
by other functions. Coefficients of regression and correlation 
would have different values if this were done. The presence 
of this arbitrary element in the correlation of deviations 
from lines of secular trend detracts somewhat from the 
confidence that may be placed in the results. The critical 
problem here lies not in the mechanical process of correla- 
tion, but in the choice of an appropriate line of trend for 
each series. If, by the tests of inspection and of corre- 
spondence with such external evidence as may be available, 
it appears that the curve selected accurately represents the 
trend in each of the series correlated, the coefficient may 
be accepted as significant. But, in the interpretation and 
use of the results, the presence of this element of personal 
judgment in the preliminary calculations must not be 
forgotten. This applies with particular force if the study 
aims to establish a functional relationship between cyclical 
fluctuations in the two series, and if an estimating (or 
regression) equation is to be based upon the results. 

The Coefficient of Correlation and the 
Measurement of Time Sequence 

In the correlation of cotton production and cotton prices 
the object was to measure as accurately as possible the 
effect of variations in cotton production upon cotton prices. 
An equation was secured which described this relation 
when deviations were measured from the particular lines 
of trend employed. Cotton prices were considered to be a 
function of cotton production, and the object of the study 
was to measure this functional relationship. We seek, 
in such cases, to determine the degree to which cycles 
in one series depend upon or reflect cycles in a related 
series, assuming some functional relationship between them. 
This is essentially the problem described in introducing the 
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subject of correlation, and generally constitutes the major 
problem in studying the relation between series of any type. 

But a second and somewhat different problem may be 
faced in certain studies of time series. Assuming that 
two such series are marked by definite cycles, it is of 
interest to determine whether the cycles coincide in time, 
or whether cycles in one series consistently precede or lag 
behind cycles in the other. The coefficient of correlation 
has been found very useful in determining the degree of 
“lead” or “lag” in such cases. This problem is that of 
determi n ing merely temporal relationship, as opposed to 
the functional relationship that is ordinarily to be measured. 

THE RELATION BETWEEN STOCK PRICE CYCLES AND CYCLES 
OF BUSINESS ACTIVITY 

To illustrate the solution of a problem of this latter type, 
we may undertake to determine the relation, in time, 
between cyclical movements in industrial stock prices and 
in general business activity, as measured by the composite 
index compiled by the American Telephone and Telegraph 
Company. The monthly values of this index for the period 
1899-1937 have been presented in an earlier section. Figures 
relating to stock prices from January, 1903, to June, 1914, 
are given in Table 97. 

Table 97 

Cycles in Industrial Slock Prices, 1903-1914 1 
(Figures relate to deviations from trend in units of the standard deviation) 

Month 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 
January - .2 - 2.0 - .1 +2.3 +16 -1.3 + .5 +1.0 - .1 - .4 - .2 - .7 

February - .1 -2.1 + .2 + 2 2 +14 -1.5 + .3 + .5 + 1 - .4 - .6 - .6 

March - .3-2.1 + .6 +1.9 + .6 -1.1 + .3 + .8 - .1 - .1 - .7 - .6 

April - .5 -2.0 + .7 +1.7 + .6 - .8 + .5 + .5 - .1 + .3 - .6 - .8 

May - .6 -2.1 + .2 +1.4 + .4 - .5 + .8 + .4 0 + .2 - .7 - .8 

June - .9 -2.1 + .3 +1.5 + .2 - .5 + .9 + .1 + .1 + .2 -1.1 -.7 

July -1.4 -1.8 + .7 +1.3 + .3 - .2 +1.1 - .4 + .1 +.2 - .9 

August - 1.7 - 1.6 + .8 + 1.7 - .3 + .3 + 1.4 - .3 - .3 + .3 - .7 

September - 1.9 - 1.3 + .7 + 1.7 — .5 + .1 +1.4 - .3 - .7 + .4 - .6 

October - 2.3 - .9 + .8 +1.7 - 1.3 + .2 + 1.4 0 - .7 + .3 - .8 

November - 2.4 - .3 +1.1 +1.7-1.9 + .5 +1.4 + .1 -.5 +.2 - .9 

December - 2.1 - .1 + 1.8 + 1.7 - 1.6 + .5+13- .2- 4 0 - .9 

1 These figures, the results of analyses by W. M. Persons, are from the 
Review of Economic Statistics , published by the Harvard Committee on Eco- 
nomic Research. They are based upon the average price of 12 industrial stocks. 
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The data of the two series are plotted in Fig. 78. 1 From 
a comparison of the two curves in this chart it is clear 
that there is some relation between the movements in 
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Fig. 78. — Comparison of Cyclical Fluctuations in Industrial Stock Prices 
and in General Business Activity, 1903-1914 


the two series, but such a comparison affords no basis 
for a definite conclusion. Our object is to determine whether 
the cycles in the two series are exactly synchronous and, 
if they are not, to measure the average time interval by 
which cycles in one series precede the cycles in another. 
The significance of such studies in the analysis of the business 
cycle is obvious. 

For the study of pre-war relations data for the period 
from January, 1903, to June, 1914, may be employed. 
A coefficient of correlation is first computed for concur- 
rent items. A value of + .55 is secured. Next, the data 
are correlated with industrial stock prices preceding general 

1 The American Telephone and Telegraph index here plotted is not identical 
with that given in Chapter IX. The latter is a revised series, differing in some 
respects from the original index for the pre-war period that has been used in 
the present calculations. 
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business by one month. That is, the January, 1903, figure 
for stock prices is multiplied by the February, 1903, index 
of general business; the February stock price is multiplied 
by the March business index, etc. This process is carried 
through for the entire period from January, 1903, to June, 
1914. Only 137 monthly values are used in this computa- 
tion, as compared with 138 in the preceding case, for the 
January, 1903, business index and the June, 1914, stock 
price figure do not enter into the calculations. Accordingly, 
the values c* and c y (the two corrections to be applied 
because the origin does not coincide with the two averages) 
and the two standard deviations will be slightly different. 
These corrections may be readily made. The coefficient 
of correlation secured from these computations has a value 
of + .65. The same operation is repeated with other 
pairings of the two variables. The results are summarized 
below. 


Table 98 

Coefficients of Correlation between Industrial Stock Prices and an 
Index of General Business Activity 
(Based upon data for the period 1903-1914) 

Coefficient of Correlation 


Stock prices concurrent with business index + . 55 

Stock prices preceding business index by 1 month + . 65 
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These figures are plotted in Fig. 79. 

The coefficients increase to a maximum value of + . 76 
which is secured with stock prices preceding general business 
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by 4, 5, and 6 months. The stability of the coefficients 
with the period of “lead” varying from 3 to 7 months 
indicates that there was no one specific interval, within 
the limits thus indicated, between the cyclical movements 
of these two series. From the results here given it would 
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Fig. 79. — Coefficients of Correlation between Index of Industrial Stock 
Prices and Index of Business Activity, 1903-1914, Showing the Results 
Secured with Different Pairings. (In all pairings except that of concurrent 
items the business activity index follows the stock price index) 

appear that five months was the average interval by which 
stock prices preceded the general business index, but this 
was not sharply marked off as a constant relationship. 

With this record of pre-war relations we may contrast the 
experience of recent years. The Index of Industrial Activity 
of the American Telephone and Telegraph Company, given 
in Chapter IX, defines the state of business. Of stock 
price index numbers, the measurements currently published 
in the Review of Economic Statistics 1 are in a form best 

1 This is not a homogeneous series for the entire period covered. For the 
years 1919-1924 the index is based on the average price of 20 industrial stocks 
(the Dow-Jones index), expressed as deviations from trend in units of the 
standard deviation. For the period 1925-1937 the official all-inclusive index 
of the New York Stock Exchange (index No. 2) has been used. This index, 

{Footnote 1 continued on page 895) 
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adapted to our present needs, although a change in coverage 
during the period detracts somewhat from their utility for 
comparative purposes. Monthly values of this index for 
the period 1919-1937 are recorded in Table 99. The two 
series are plotted in Fig. 80. 

Table 99 


Cycles in Stock Prices, 1919-1937 1 


Month 

1919 

1920 

1921 1922 1923 

1924 

1925 

1926 

1927 

1928 

Jan. 

4- .38 

4- 1.44 

- .69 - 

.68 - .12 

- .47 

4- .15 

4-1.11 

4-1.04 

4-2.51 

Feb. 

4- .38 

4- .78 

- .70 - 

.53 4- .05 

- 43 

4- .17 

4- .90 

4- 1.31 

4-2.31 

March 

+ .70 

-f 106 

- .72 - 

.38 4- 15 

- .58 

- .21 

4- .33 

4- 1.25 

4-2.96 

April 

4- .90 

4-1.10 

- .68 - 

.18 - .02 

- .81 

- .12 

4- .52 

4- 1.32 

4-3.37 

May 

4- 1.40 

4- .50 

- .67 - 

.11 - .31 

- .88 

4- .19 

4- .55 

-f 1.56 

4-3.38 

June 

4- 1 76 

4- .46 

- 1.15 - 

.14 - .46 

- .82 

4- .27 

4- 81 

4-1 39 

4-2.82 

July 

4-2.01 

4- .38 

- 1,20 - 

.08 - .70 

- .58 

4- .41 

4- 1.00 

-f 1.94 

4-2.89 

Aug. 

4- 1.60 

4- .04 

- 1.32 4- .05 - .65 

- .39 

4- .43 

4-1.11 

4-2.07 

4-3.37 

Sept. 

4- 1.78 

-f .11 

- 1.16 4- 

.10 - .70 

- 44 

4- .53 

4- 1.13 

4-2.42 

4-3.63 

Oct. 

4-2.14 

- .04 

- 1.12 4- 

.10 - .84 

- .50 

4- 1.01 

4- .89 

4-2.06 

4-3.69 

Nov. 

4- 1.90 

- .44 

- .89 - 

.16 - .72 

- .25 

4- .90 

4- 107 

4-2.55 

4-4.39 

Dec. 

4- 1.54 

- .85 

- .70 - 

10 - .60 

- .02 

4-1.05 

4-1.05 

4-2.67 

4-4.41 

Month 

1929 

1930 

1931 

1932 1933 

1934 

1935 

1936 

1937 

Jan. 

4-4.21 

4-1.11 

- 1.35 

- 4 02 - 

4.31 

- 2.83 

- 3 30 

- 1.61 

- .64 

Feb. 

4-4.11 

4- 1.28 

- .83 

-3.90 - 

4.65 

- 2 90 

- 3.37 

- 1.51 

- .61 

March 

4-3.70 

4- 1.83 

- 1 22 

-4.20 - 

4.62 

-2.89 

- 3.51 

- 1.51 

- .65 

April 

4-3 84 

4- 1.59 

- 1.73 

- 4 64 - 

3 91 

- 2.93 

- 3.23 

- 1.92 

- 1.11 

May 

4-3 17 

f 1.41 

- 2 35 

-5.05 - 

3.33 

-3.19 

- 3.14 

- 1.71 

- 1.17 

June 

4-3.88 

4- .13 

- 1 85 

- 5.09 - 

2 90 

- 3.13 

- 2.97 

- 1.61 

- 1.44 

July 

4-4.25 

4- .25 

- 2 15 

-4.60 - 

3.26 

-3.51 

- 2.71 

- 1.31 

- 1.03 

Aug. 

4-4.89 

4- .24 

- 2.18 

- 3.86 - 

2.89 

- 3.37 

- 2.62 

- 1.27 

- 1.31 

Sept. 

4-4.10 

- .51 

- 3.42 

-3.97 - 

3 31 

- 3.40 

- 2 55 

- 1.23 

-2.04 

Oct. 

4- 1.67 

- 1 04 

- 3.23 

-4.30 - 

3.57 

- 3.45 

- 2.29 

- .90 

- 2.48 

Nov. 

4- .68 

- 1.21 

- 3.54 

-4.42 - 

3.32 

- 3.21 

- 2.10 

- .78 

- 2.85 

Dec. 

4- .74 

- 1.65 

- 3.99 

- 4.37 - 

3.27 

- 3.21 

- 1.93 

- .81 

- 3.03 


Results obtained from a study of the temporal relations 
between these two series, for the period 1919-1937, are 
given in Table 100. 

(Footnote 1 continued from page 898) 

originally constructed with the figure for Jan. 1, 1925 as 100, has here been 
expressed in terms of deviations from 100, in units of a standard deviation 
assumed to be equal to 15 on the original scale. In effect, a horizontal trend 
at the level of Jan. 1, 1925, has been assumed for the Stock Exchange index. 
This index has also been shifted slightly in time. The index figure relating 
to the first day of a given month, in the Stock Exchange tabulations, has here 
been recorded as for the month preceding. Thus a February 1st index is en- 
tered for January, a March 1st index for February, etc. 

1 From the Review of Economic Statistics. The figures in the table define 
deviations from trend, in units of the standard deviation, with the assumptions 
stated in the preceding footnote. The coefficients in Table 100 are based upon 
data through July, 1937, only. 
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Table 100 

Coefficients of Correlation between Stock Prices and an Index of 

Business Activity 

(Based upon data for the period 1919-1937) 

Coefficient of Correlation 

Stock prices concurrent with business index + . 85 

Stock prices preceding business index by 1 month + . 86 
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These measurements arc shown graphically in Fig. 81. 

In using these coefficients we should note that the stock 
price records for part of the recent period are different 
in important respects from those employed for the pre-war 
period. In place of the 12 industrial stocks entering into 
the earlier comparisons the index for the recent period 
included 20 stocks and, later, a comprehensive list composed 
of all varieties of stocks. The market behavior of the 
broader list may have departed somewhat from the pattern 
set by the limited number of industrial stocks. The differ- 
ence between the results for the two periods is to be inter- 
preted with this fact in mind. 

In post-war years the highest degree of correlation pre- 
vailed with the business index following the stock price 
index by one month. The traditional “lead” of stock 
prices, on the basis of which the movements of these prices 
have been used as forecasters of business changes, was 
clearly reduced in this period. The actual statistical record 
we have obtained may have been affected somewhat by 
the broadening of the coverage of the stock price index 
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used, but the change in the relations between the two 
series appears to have been a real one. 

This method of measuring temporal relations between 
economic series is highly useful, but one important caution 
should be noted. The method indicates the average degree 
of lead or lag of one series, with reference to another. 
Frequently the sequences of change in economic series are 
not the same in all phases of business cycles. Thus, observa- 
tions relating to ten business cycles occurring between 1890 
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Fig. 81. — Coefficients of Correlation between Index of Industrial Stock 
Prices and Index of Business Activity, 1919-1937, Showing the Results 
Secured with Different Pairings. (In all pairings except that of concurrent 
items the business activity index follows the stock price index'! 

and 1925 indicate that pig iron prices preceded the general 
index of wholesale prices by 3.4 months, on the average, 
in business recessions, but followed the general index by 
5.1 months, on the average, in periods of business revival. 1 
This highly important difference would be ironed out in 
the measurement of average temporal relations by the 

1 Cf. The Behavior of Prices , New York, National Bureau of Economic 
Research, 1927, 84-87. 
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correlation method. The use of an average should be 
supplemented by a study of the items entering into the 
average. Study of the relations among individual observa- 
tions at different cyclical phases is essential when correlation 
technique is employed to define time sequences among the 
movements of economic series. 

The Use op the Moving Average in Correlating 
Cycles in Time Series 

The preceding discussion has dealt only with cycles as 
measured from mathematically fitted lines of trend. But 
trend may be measured, as we have seen, by lines based 
upon moving averages, and the cyclical deviations from 
such lines may be correlated in precisely the same way as 
deviations from other lines of trend. The arithmetic 
mean of the deviations from such moving averages will 
not necessarily be zero, as in the case of deviations measured 
from lines fitted by the method of least squares, and a corre- 
sponding correction must be made in correlating such figures. 

Moving averages are subject to the same criticism as 
are mathematical lines of trend. There can be no certainty 
that deviations from lines of trend based upon moving 
averages represent the effects of cyclical causes solely. The 
result in a given case depends upon the period of the moving 
average employed, and there is no perfect criterion by 
which to determine the best measure of trend. Significant 
and useful coefficients may be computed when deviations 
are measured from moving averages, but the presence of 
an arbitrary element in the work must be recognized and 
the results applied with corresponding reservations. 

The Correlation op Short Term Fluctuations 

In describing the variable factors that constitute compo- 
nent elements of the values of a series in time, it was pointed 
out that the coefficient of correlation would not generally 
be employed in comparing either the secular trends or the 
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seasonal fluctuations of two series. It may be used to 
advantage in measuring either functional or temporal rela- 
tions between cyclical fluctuations, provided that the effects 
of the other variables have been, so far as possible, elimi- 
nated. The coefficient of correlation and the measures 
which are employed in conjunction with it have a further 
use in dealing with time series. They may be used to meas- 
ure the relation between short term changes in two series, 
changes from year to year, month to month, or even from 
week to week or day to day, if desired. This problem is 
distinct from that studied in the preceding section and in 
the interpretation of the results the two should not be 
confused. 

There are several ways in which the problem of comparing 
short term fluctuations may be attacked. The absolute 
differences between successive items in two series may be 
correlated, or these differences may be expressed as per- 
centages or ratios. Table 101 illustrates the procedure 
employed in measuring the correlation between the absolute 
fluctuations from year to year (first differences) of cotton 
production and cotton prices. The original values from 
which the items in columns (2) and (3) are derived are 
given in Table 95. 

The process of computing r is identical with that em- 
ployed in preceding examples, when deviations were meas- 
ured from an arbitrary origin. The arbitrary origin in 
this case is zero, but corrections must be made in the 
various values since the algebraic sum of the given figures 
is not zero in either case. Computations based on the 
figures in Table 101 follow: 


c* = — = + 933 = + .02744 
* N 34 

c x 2 = .000753 


<r x 



V' 


229.624987 


.000753 = 2.599 


34 
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Table 101 


Computation of Coefficient of Correlation between 
and Cotton Prices , 1902-1936 
(Based upon first differences) 


( 1 ) 

Crop 

Year 


( 2 ) 

Difference be- 
tween produc- 
tion in given 
year and pro- 
duction in pre- 
ceding year (in 
millions of 
bales) 

X 


( 3 ) 

Difference be- 
tween price in 
given year and 
price in pre- 
ceding year (in 
cents per pound , 
deflated) 

Y 


( 4 ) 


Cotton Production 

( 5 ) ( 6 ) 

F 2 XY 


1902-03 

+ 

1 . 

121 

1903-04 

— 


780 

1904-05 

+ 

3 

,587 

1905-06 

— 

2 

,863 

1906-07 

+ 

2 

.699 

1907-08 

— 

2 

.167 

1908-09 

+ 

2 

.135 

1909-10 


3 

237 

1910-11 

+ 

1 

.604 

1911-12 

+ 

4 

.084 

1912-13 

— 

1 

.990 

1913-14 

+ 


.453 

1914-15 

+ 

1 

.979 

1915-16 

— 

4 

.943 

1916-17 

+ 


.258 

1917-18 

— 


.148 

1918-19 

+ 


.739 

1919-20 

— 


.620 

1920-21 

+ 

2 

019 

1921-22 

— 

5 

,486 

1922-23 

+ 

1 

808 

1923-24 

+ 


,378 

1924-25 

+ 

3 

.488 

1925-26 

+ 

2 

476 

1926-27 

+ 

1 

,873 

1927-28 

— 

5 

,021 

1928-29 

+ 

1 

.522 

1929-30 

+ 


.347 

1930-31 

— 


.893 

1931-32 

+ 

3 

.164 

1932-33 

— 

4 

.094 

1933-34 

+ 


.045 

1934-35 

— 

3 

.411 

1935-36 

+ 


.807 


+ 


.933 


+ .54 1.256641 

+ 4.34 . 608400 

- 5.17 12.866569 

+ 2.62 8.196769 

- 1.25 7.284601 

+ 1.14 4.695889 

- 1.50 4.558225 

+ 3.79 10.478169 

+ .60 2.572816 

- 4.79 16.679056 

+ 1.44 3.960100 

+ 1.62 . 205209 

- 5.20 3.916441 

+ 1.73 24.433249 

+ 2.18 .066564 

+ 3.03 . 021904 

- .34 .546121 

+ 2.27 .384400 

- 6.02 4.076361 

+ 3.63 30.096196 

+ 2.86 3.268864 

+ 4.34 .142884 

- 5.70 12.166144 

- 2.79 6.130576 

- 3.28 3.508129 

+ 3.42 25.210441 

+ .27 2.316484 

- .93 .120409 

- 3.13 . 797449 

- 2.28 10.010896 

+ 1.39 16.760836 

+ 2.04 . 002025 

+ .75 11.634921 

- 1.35 .651249 

+ , 27 ~ 229.624987 


.2916 

+ 

.60534 

18.8356 

— 

3.38520 

26.7289 

— 

18.54479 

6.8644 

— 

7.50106 

1.5625 

— 

3.37375 

1.2996 

— 

2.47038 

2.2500 

— 

3.20250 

14.3641 

— 

12.26823 

.3600 

+ 

.96240 

22.9441 

— 

19.56236 

2.0736 

— 

2.86560 

2.6244 

+ 

.73386 

27.0400 

-- 

10.29080 

2.9929 

— 

8.55139 

4.7524 

■+ 

.56244 

9.1809 

— 

.44844 

.1156 

— 

.25126 

5.1529 

— 

1.40740 

36.2404 

— 

12.15438 

13.1769 

— 

19.91418 

8.1796 

+ 

5.17088 

18.8356 

+ 

1.64052 

32.4900 

— 

19.88160 

7.7841 

— 

6.90804 

10.7584 

— 

6.14344 

11.6964 

— 

17.17182 

.0729 

+ 

.41094 

.8649 

— 

.32271 

9.7969 

+ 

2.79509 

5.1984 

— 

7.21392 

1.9321 

— 

5.69066 

4.1616 

+ 

.09180 

.5625 

— 

2.55825 

1.8225 

- 

1.08945 

313.0067 


180.19834 
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<V = l£ - ± 3i-“ + 00794 
Cy * = .000063 

*. - j/f^* = -j/M: - .000063 - 3.034 

p ■ gga - ^ - (.02744 X .00794) 

iV o4 

p = - 5.300168 

_ p _ - 5.300168 

r ( t x <x v 2.599 X 3.034 

r = - .672. 

The equation of regression and the value of S u , computed 
from the usual formulas, are 

y = — .78:e 
S y = 2.25 cents. 

A comparison of the different results secured in the 
preceding examples relating to cotton throws some inter- 
esting light upon the general problem of correlation. In 
fact, in the two examples, we have measured the correlation 
between measurements that are not strictly comparable — 
deviations from third degree parabolas, in the first case, 
and year-to-year fluctuations in the production and price 
of cotton, in the second. Yet, if we were seeking to estimate 
the price of cotton which would accompany a given crop, 
an estimate might be based upon either of the studies, 
the results of which are given below. 

r S„ 

Correlation of cycles in cotton production and 
prices (deviations measured from third degree 
parabolas) — • 648 1 . 94 cents 

Correlation of year-to-year fluctuations, same data — .672 2.25 cents 

The value of r in the second example is slightly greater 
than the value secured in the first case, though the standard 
error is also larger. The reason for this apparent contradic- 
tion has been suggested above; the standard deviation of 
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(he year-to-year fluctuations in cotton prices is greater 
than the standard deviation about the trend of cotton prices. 

It appears that errors of estimate are less when based 
upon the results secured when deviations from third degree 
curves are correlated than when based upon the study of 
year-to-year movements. But there is a concealed assump- 
tion in the first case, the assumption that the lines of trend 
of both prices and production may be projected beyond the 
period studied. There is an immeasurable margin of error 
in this assumption, and the standard error of estimate, 
accordingly, does not give a true measure of the probabilities 
involved. No such assumption is involved in the measure 
based upon year-to-year fluctuations. 
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CHAPTER XII 


THE MEASUREMENT OF RELATIONSHIP: 
NON-LINEAR CORRELATION 

In the preceding chapters the discussion has been confined 
to cases in which the relationship between two variables 
may be described by a straight line. The coefficient of 
correlation, r, is a measure of the degree to which two 
variables approach a Unear relationship and it is signifi- 
cant only when a straight line gives a good fit to the points 
representing the paired values of X and Y. 

In fitting curves to time series, as explained in an earlier 
section, it is found that in many cases the trend is non- 
linear, and that a curve of higher degree is needed. The 
same thing is true in the field of our present discussion. 
It is possible to have a high degree of correlation between 
two variables when a straight line does not describe the 
relationship. In such a case there would be considerable 
scatter about the straight line of best fit, and the value 
of r would be misleadingly low. If a curve representing 
the real relationship could be fitted, the scatter would 
be materially reduced and the true correlation could be 
measured. The figures presented in Table 102 illustrate 
such a case. These data are plotted in Fig. 82. 

Two different curves have been fitted to the points 
plotted in this figure. One is a straight line having the 
equation 

Y = 5.038 + .0886 A 

in which Y represents yield, in tons per acre, and X repre- 
sents depth of irrigation water applied, in inches. The 
degree of relationship between the two variables, as de- 
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scribed by this line, is indicated by the coefficient oi 
correlation, r, which has a value of + . 69. 

Table 102 

Alfalfa Yield and Irrigation 
Summary of investigations at Davis, California 1 
(The measurements in the body of the table measure yields, in tons pei acre, 

in 44 experiments) 

Inches of irrigation water applied 



0 

12 

18 

24 

30 

36 

48 

60 


2.35 

4.31 

5.69 

6.00 

7.53 

7.58 

8.05 

5.55 


2.75 

4.78 

6.46 

6.89 

7.97 

8.22 

8.45 

7.25 


2.89 

4.84 

7.02 

7.96 

8.32 

8.63 

8.63 

10.17 


3.85 

5.83 

8.02 

8.32 

9.43 

9.33 

8.83 

10.70 


5.52 

6.51 


8.38 

9.54 

9.38 

9.52 


Average 

5.94 

7.52 


9.96 

11.06 

12.48 

10.62 


yield 

3.88 

5.63 

6.80 

7.92 

8.98 

9.27 

9.02 

8.42 7.48 


An inspection of the figure shows clearly that the straight 
line does not give the best possible fit. It is certain, there- 
fore, that r does not furnish a valid measure of the degree 
of relationship between alfalfa yield and depth of irrigation 
water. 


Parabolic Relationship 

The other curve in Fig. 82 is a second degree parabola, 
fitted by the method of least squares. The equation to this 
curve is 

Y = 3.539 + . 2527X - .002827X 2 . 

It is obvious that the effect of increasing irrigation upon 
alfalfa yield is described much more accurately by this 
latter curve than by the straight line. The most important 
result of these investigations was the determination of the 
point at which alfalfa yield began to fall off with increased 
applications of water, and the straight line fails to indicate 
any such decline. 

1 This table is taken from “The Economical Irrigation of Alfalfa in the Sac- 
ramento Valley ” by S. H. Beckett and R. D. Robertson, Bull. No. 280 l 
Agricultural Experiment Station, Univ. of California, May, 1917. 
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As the equation of relationship, therefore, we should use 
the parabolic rather than the linear form. The standard 
error, S y , which is a necessary accompanying measure, may 
be calculated by measuring the deviation of each value 
from the corresponding computed value, and determining 



Fiq. 82. — Scatter Diagram Showing the Relation between Alfalfa Yield 
and Irrigation Water Applied, with Two Lines of Regression 


the root-mean-square of these deviations. This procedure 
is illustrated in Table 103. The figures for normal yield 
which are given in this table are computed from the parabolic 
equation given above. 

Inserting the sum of the squared deviations, as given in 
col. (5) of Table 103, in the formula 


S y = 



2d* 

N 


we have 


S t 




80.9945 


44 


= 1.36. 
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Table 103 

Comparison of Actual and Computed Alfalfa Yield 


( 1 ) 

( 2 ) 

( 3 ) 

( 4 ) 

( 5 ) 

Tionih /it 


Normal yield , 

Deviation of 


uvpin UJ 

irrigation 

water 

Actual yield 

as computed 
from parabolic 

actual from 
normal 



equation 

( 2 ) - ( 3 ) 


X 

Y 

Y c 

d 

cP 

0 

3.85 

3.54 

+ .31 

.0961 

0 

5.94 

3.54 

+ 2.40 

5.7600 

0 

5.52 

3.54 

+ 1.98 

3.9204 

0 

2.75 

3.54 

- .79 

.6241 

0 

2.89 

3.54 

- .65 

.4225 

0 

2.35 

3.54 

- 1.19 

1.4161 

12 

4.78 

6.16 

- 1.38 

1.9044 

12 

7.52 

6.16 

+ 1.36 

1.8496 

12 

6.51 

6.16 

+ .35 

.1225 

12 

4.31 

6.16 

- 1.85 

3.4225 

12 

5.83 

6.16 

- .33 

.1089 

12 

4.84 

6.16 

- 1.32 

1.7424 

18 

7.02 

7.17 

- .15 

.0225 

18 

5.69 

7.17 

- 1.48 

2.1904 

18 

8.02 

7.17 

+ .85 

.7225 

18 

6.46 

7.17 

- .71 

.5041 

24 

6.00 

7.98 

- 1.98 

3.9204 

24 

8.38 

7.98 

+ .40 

.1600 

24 

8.32 

7.98 

+ .34 

.1156 

24 

6.89 

7.98 

- 1.09 

1.1881 

24 

9.96 

7.98 

+ 1.98 

3.9204 

24 

7.96 

7.98 

- .02 

.0004 

30 

7.53 

8.58 

- 1.05 

1.1025 

30 

9.54 

8.58 

+ .96 

.9216 

30 

9.43 

8.58 

+ .85 

.7225 

30 

7.97 

8.58 

- .61 

.3721 

30 

11.06 

8.58 

+ 2.48 

6.1504 

30 

8.32 

8.58 

- .26 

.0676 

36 

7.58 

8.97 

- 1.39 

1.9321 

36 

9.33 

8.97 

+ .36 

.1296 

36 

9.38 

8.97 

+ .41 

.1681 

36 

8.22 

8.97 

- .75 

.5625 

36 

12.48 

8.97 

+ 3.51 

12.3201 

36 

8.63 

8.97 

- .34 

.1156 

48 

8.45 

9.16 

- .71 

.5041 

48 

9.52 

9.16 

+ .36 

.1296 


(Continued on next page) 
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Table 103 ( Continued ) 

Comparison of Actual and Computed Alfalfa Yield 


(1) 

(2) 

(3) 

(4) 

(5) 

Depth of 
irrigation 
water 

Actual yield 

Normal yield 
as computed 
from parabolic 
equation 

Deviation of 
actual from 
normal 
(2) - (3) 


X 

Y 

Y c 

d 

d 2 

48 

8.63 

9.16 

- .53 

.2809 

48 

8.83 

9.16 

- .33 

.1089 

48 

10.62 

9.16 

+ 1.46 

2.1316 

48 

8.05 

9.16 

- 1.11 

1.2321 

60 

10.17 

8.52 

+ 1.65 

2.7225 

60 

7.25 

8.52 

- 1.27 

1.6129 

60 

10.70 

8.52 

+ 2.18 

4.7524 

60 

5.55 

8.52 

- 2.97 

8.8209 


80.9945 


THE INDEX OF CORRELATION 

We need now the third value, the abstract measure of 
degree of relationship. In dealing with cases of linear 
relationship in the preceding chapter we found that such 
a measure, the coefficient of correlation, could be derived 
from known values of S v and a v . An analogous measure 
may be derived in the same way in cases of non-linear 
relationship, such as that found in the present problem. 
Since the term coefficient of correlation and the symbol r 
refer only to cases of linear regression, we may term this 
general measure the index of correlation, and use the symbol 
p (rho) to represent it. 

As a general formula for the index of correlation we 
have 1 

1 With X dependent this formula becomes 



The first of the two subscripts refers always to the dependent variable, the 
second to the independent. It is essential that these be shown, for p would 
not necessarily be the same with X dependent as with Y dependent. Such a 
distinction is not necessary in the case of linear correlation, for r is the same 
no matter which variable be dependent. 
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The value of S v has been derived above. The value of 
<r„, computed by familiar methods, is found to be 2.27. 
Substituting in the formula for p, we have 

Pyx = 

= . 80 . 

This value is materially greater than that of the coefficient 
of correlation for the same data. The value of r is + . 69. 
The difference is due to the fact that the second degree 
parabola constitutes a much better fit to the data than 
the straight line. The correlation is distinctly non-linear, 
and r is an inappropriate measure of correlation. 

THE MEANING OF THE INDEX OF CORRELATION 

It is important that the significance and the limitations 
of p be understood. Its value depends upon the relation 
between the scatter about the fitted line and the scatter 
about the arithmetic mean of the F’s. In the case of a 
straight line, p and r are identical, r being a special case 
of p. The limits of p are 0 and 1, a value of 0 indicating 
that there is no relationship, or that if there is a relation- 
ship between the two variables it cannot be described by the 
particular equation employed. A value of 1 indicates that 
the relationship, as described by the equation employed, 
is a perfect one. For curves of higher degree no positive or 
negative sign should be attached to p, for the relationship 
might be positive over part of the range and negative over 
other parts, as in the alfalfa example given above. 

The index of correlation, p, has no significance unless the 
type of curve to which it applies be named in each case. 
The meaning of r in this respect is always clear, for it is 
understood that it relates always to a straight line, but 
confusion would arise in the case of p unless the type of 
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curve were specifically mentioned. The index of correlation 
may be looked upon as a measure of the adequacy of a 
curve of given type to describe the relationship between two 
variables. 

It is, of course, always possible to secure a curve which 
will pass through any number of points if the constants 
in the equation be equal to the number of points. In such 
a case p would, of necessity, be equal to 1, but this value 
would have no significance. In any employment of mathe- 
matical functions there is this limit of absurdity, when the 
number of constants is equal to the number of points, and 
p would merely reflect this absurdity. The ordinary prin- 
ciples of curve fitting must be kept in mind in using such 
an index as this. It must never be taken to have an absolute 
significance, standing by itself. Its significance is always 
relative, referring to the particular function employed. 
This fact, which is true of every measure of correlation, 
is frequently overlooked, and invalid and fallacious con- 
clusions reached as a result. 

A SHORT METHOD OF COMPUTING THE INDEX OF CORRELATION 

The standard error and the index of correlation were 
computed by a rather laborious method in the above ex- 
ample, in order that there might be no misunderstanding 
of their precise meaning. The burden of calculation may 
be materially reduced, however, by taking advantage of 
the relationships which were disclosed in dealing with r. 
For a curve of the potential series 

Y = a + bX + cX 2 + dX* . . . 

the formula for S y is derived by a simple extension of that 
employed in the case of the straight line. As a general 
formula for a series of this type, we have 

„ , 2(n - flS(F) - &2(X7) - c2(X’F) - d2(X*Y) - . . . 

Sy ~ N 

Similarly, the formula for r may be extended to give a 
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general formula for p applicable to any equation of this 
general type. This formula 1 is 

, _ «2(F) + feS(AF) + c2(X 2 F) + d2(X*Y) + . . . - NcJ 
Pvx 2(F 2 ) - Nc v 2 

In the special case in which the origin is at the mean of 
the F’s, 2(y) =0 and c„ = 0, and the formula reduces to 

. # _ b?(Xy) + c2(X*y) + dZ(X*y) + . . . 

Pyx __ 

The characteristics of the formulas for S and p should 
be noted. The only values required in securing these 
measures are the constants in the equation which describes 
the average relationship, certain values which have been 
used in the process of fitting and, in addition, 2(F 2 ) and 
c u 2 . Thus, as direct by-products of the fitting process, we 
have the values of S and p, the two measures which are 
needed to supplement the regression equation in securing 
a complete description of the relationship between the two 
variables in question. The equation describes the average 
relationship. The standard error, S, is a measure of the 
reliability of estimates based upon this equation, and p is 
an abstract index of the degree of relationship, in so far as 
that relationship can be described by the particular curve 
employed. 

The application of these formulas may be illustrated 
with reference to the problem of alfalfa yield. The following 
values, derived from the data of Table 102 and from the 
fitting process, are required for this purpose: 

a = 3.539 2 (Z 2 F) = 407,564.64 

b = .252652 c y 2 = 55.9197 

c = - .002827 2(F 2 ) = 2,688.2268 

2(F) = 329.03 A = 44. 

2(XF) = 10,271.72 

Substituting in the formula for the standard error for a 

1 Sec Appendix A for a discussion of the derivation of this formula. 
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second degree parabola, 

„ , 2(F 2 ) - a2(F) - V2(XY) - c2(X 2 F) 

S v ’ 


we have 

„ 2,688.2268 - (3.539 X 329.03) - (.252652 X 10,271.72) - (-.002827 X 407,564.64) 

3 ‘ ” “44 

80.8043 
~ 44 

= 1.8365 
S y = 1.36. 

The index of correlation, for a curve of this type, is 
computed from the equation 

2 _ a2(F) + bl(XY) + c2(X 2 F) - Ac„ 2 
pyx 2(F) 2 - Nc v 2 

Substituting the appropriate values, we have 

2 _ 146.9557 

Pyx 2,688 . 2268 - (44 X 55 . 9197) 

= .6452 

Pyx = .80. 

The value of the index of correlation is influenced by 
the relation between the number of observations and the 
number of constants in the equation of relationship. When 
the two are equal p will have a value of 1. In any case the 
observed index of correlation tends to exceed the true index. 
When the number of observations is not large it is advisable 
to apply a correction for this bias. If we use p to represent 
the corrected value and m to represent the number of 
constants in the equation of relationship, we may apply 
a correction in terms of the relation 1 

pyx = 1 ~ { (1 “ Pvx ^{~N~^~m) } ' 

Inserting the values given in the above example, we have 

1 From Mordccai Ezekiel, Methods of Correlation Analysis, New York, 
Wiley, 1930, 121. 
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Pv* 2 = 1- {(1 - -6452)(^— ])} 

= .6279 
Pyx = .79. 

If, in the application of this test, the value in brackets { } 
exceeds unity, the value of p is taken as 0. 1 

These methods of deriving S and p are applicable over 
a wide field by a simple adaptation of the formulas to the 
particular equations that may be employed in given 
instances. Further illustrations are given in Chapter XVII, 
while this general method is explained in more detail in 
Appendix A. 


The Correlation Ratio 


A third distinctive measure of correlation remains to 
be described. This is the correlation ratio, devised by 
Karl Pearson and represented by the symbol ?? (eta). 
This measure may be looked upon as a special case of p, 
but somewhat different methods are employed in its com- 
putation. 

We have seen that in all cases the degree of relationship 
between two variables, as described by a curve of a given 
type, may be determined from the formula 


Measure of correlation 



The coefficient of correlation, r, is just such a measure, 
when S v represents the standard deviation about a straight 
line. The index of correlation, p, is a general measure of 
the same type. The correlation ratio is precisely the same 
sort of measure, S y in this case representing the standard 
deviation about a line passing through the mean of every 


1 A corresponding correction should be made in the standard error of esti- 
mate, when derived from a small number of observations. In this case the 
correction must raise the unadjusted measure. For this correction Ezekiel 
gives 

5* = S 2 (y— “) 

\;V — mf 

where S represents the corrected standard error of estimate. 
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column in the correlation table. We have, in effect, increased 
the number of constants in the equation of the curve to 
be fitted until the number is equal to the number of columns. 
If the means of all the columns lie on a straight line, the 
correlation ratio and the coefficient of correlation will be 
equal. If the means of the columns do not lie on a straight 
line, the correlation ratio will be greater than the coefficient 
of correlation. 

No new principle is involved, therefore, in the concept of 
the correlation ratio. It is employed when the regression 
is non-linear. It measures the degree of relationship be- 
tween two variables, in so far as this relationship may be 
described by a curve passing through the mean of every 
column. If the relationship is perfect, if there is no scatter 
about the curve fitted in this way, rj will have a value of 1. 
If there is no relationship, if the scatter about the curve 
is as great as the dispersion about the mean of the Y’s, rj 
will have a value of zero. 

The formula generally employed in the computation of 
the correlation ratio differs somewhat from that given above. 
To represent the standard deviation about the line joining 
the means of the columns, the symbol <r av is employed, 
instead of S u . Its meaning is precisely the same as that 
of S y , as employed above, except that a av refers always 
to a correlation table. 

The formula may be written 



When eta is written as above (t? vi ) it refers to the regres- 
sion of Y on X (F dependent). When it is written r) xy it 
refers to the regression of X on Y ( X dependent), and its 
value depends upon the scatter about a line joining the 
means of the rows. Unlike r, which has the same value 
for both regressions, r) yx and rj xv will have different values 
unless the regression be linear. 



THE CORRELATION RATIO 


415 


THE COMPUTATION OF THE CORRELATION RATIO 

Table 104 shows the general relation between the amount 
of nitrogen, in pounds per acre, used as fertilizer in certain 
agricultural experiments, and the corresponding yield of 
wheat, in bushels per acre. 1 The points are plotted in 
Fig. 83. 

Table 104 


Correlation Table Showing the Relation between Wheat Yield 'per Acre 
and Amount of Nitrogen Used as Fertilizer 




X — Nitrogen applied in pounds per acre 



0- 

19.9 

20- 

39.9 

40- 

59.9 

60- 

79.9 

80- 

99.9 

100- 

119.9 

120- 

139.9 

MO- 
159. 9 

160- 

179.9 

Total 

Mean 

of 

Rows 


32- 

35.9 




5 

16 

12 

4 

5 

2 

44 

107.27 

Y — Wheat yield in bushels per acre 

28- 

31.9 



1 

20 

21 

8 

4 

1 


55 

88.91 

24- 

27.9 

20- 

23.9 



16 

19 






35 

60.86 



13 







13 

50.0 

16- 

19.9 


12 








12 

30.0 

12- 

15.9 


8 j 








8 

30.0 

8- 

11.9 

3 

5 








8 

22.50 

4- 

7.9 

10 









10 

10.0 

0- 

3.9 

8 









8 

10.0 

Total 

21 

25 

30 

44 

37 

20 

8 

6 

2 

193 

: 

Mean 

of 

Columns 


15.12 

24.4 

28.73 

31.73 

32.4 


33.33 

34.0 




1 This table is based upon experiments described by E. Davenport (“ Com- 
parative Agriculture" in Bailey’s Cyclopedia of American Agriculture). The 
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For the computation of rj yx by the formula given above 
we need the values of <r„ and <r ay , the latter being the 
root-mean-square deviation about the line joining the means 
of the various columns. The former value may be obtained 
readily by methods already familiar. It is possible to 
compute the quantity a ay by the method first employed 



Nitrogen Applied in Pounds per Acre 


Fig. 83. — Scatter Diagram Showing the Relation between Wheat Yield 
and Nitrogen applied as Fertilizer, with Straight Line of Regression and 
Line joining the Means of the Columns 

in calculating S y , that is, by measuring and squaring the 
deviations of the individual points from the line of regres- 
sion. In the present case, however, the line describing 
the relationship passes through the mean of each column, 
hence these means may be used in place of the “normal” 
values as computed from an equation of regression. In 
computing <r ay , therefore, the deviations of the individual 

actual figures used have been arbitrarily chosen for the purpose of the present 
illustration, but Davenport’s experiments have demonstrated the existence 
of a law similar to the one here assumed. 
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items from the means of the various columns are squared 
added, the mean determined and the square root extracted, 
just as in the computation of the standard deviation. 
Part of the procedure is illustrated in Table 105, using the 
data in the first column of Table 104. This column contains 
all items having X-values between 0 and 20. The mean 
Y - value of the 21 items falling in this column is 5.05; 
deviations are measured from this value. 

Table 105 


Computation of the Squares of the Deviations about the Mean of an 


Class-interval 
( wheat yield in 
bn. per acre) 

m 

/ 

Array 

Deviation from 
mean of column 
(5.05) 
d 

d 2 

fd* 

8-11 9 

10 

3 

4.95 

24.5025 

73.5075 

4-7 9 

6 

10 

.95 

.9025 

9.0250 

0- 3.9 

Total 

2 

8 

- 3.05 

9.3025 

74.4200 

156.9525 


The sum of the squared deviations is obtained for each 
of the other columns in a similar fashion. The standard 
deviation about the means of all the columns, <r av , is found 
to have a value of 2 . 420. The value of <x y is 9 . 188. 

Substituting the given values in the formula 

(j 2 

„ 2 = 1 °VL 
lux 1 2 

„ 2 — ! — ( 2- 42 ) 2 
Vuz (9. 188) 2 

= 1 - .0694 
= .9306 
rjyx = .965. 

This is the value of the correlation ratio, measuring the 
degree of scatter about a line running through the means 
of the columns. Its significance is discussed below. 

The method of calculation employed in the preceding 
example may be materially shortened. Let <r mv represent 
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the standard deviation of the means of the various columns 
about the arithmetic mean of all the F’s. In computing 
this value the mean of each column is weighted by the 
number of items in that column. It may be shown 1 that 


1 The following proof is adapted from Yule. 

Given a series with mean M made up of two component series with means Mi 
and Mi. N , the total number of observations, is equal to Ni + AT 2 , the sum of 
the observations in the two component series. What is the relation between a, 
ax and <T2? If we let Aft — M = Ci 

then for S\ 2 } the mean-square deviation of the observations in the first of the 
two component series, measured from M as origin, we have 

Si 2 — <Ti 2 + Cl 2 . 

Similarly S2 2 — a 2 2 + c 2 2 . 

But N1S1 2 is equal to the sum of the squares of the deviations, about Af, of 
the items in the first of the component series, and N2S2 2 is equal to the sum 

of the squares of the deviations, about A/, of the items in the second of the 

two component series. Therefore 

„ AW + AW 
a ~ N 

and N * 2 = AW + AW. (1) 

But Si 2 — ai 2 -f* C1 2 and S2 2 = <j % 2 + c 2 2 

therefore Na 2 = A r i(<ri 2 + Ct 2 ) 4 N% (a 2 2 -b c 2 2 ). (2) 

In the present case we have the major series with mean represented by M yi 

and a number of component series (the items arranged by columns) with means 
represented by m y 1, etc. Let S tt y represent the standard deviation of any column 
of F’s about the mean of that column. Then we have a number of component 
series, with standard deviations S ay 1, etc., and with means differing from the 
mean of all the F’s by M y — m y 1, etc. Substituting in equation (2), we have 
N<t v 2 = ntlS^ 2 + {M v - my,) 2 ] + ri2[S a y 2 4 (M v - m U2 ) 2 ] + . . . (3) 

Nay 2 = Xn[Say 2 + (M y - m v ) 2 ]. (4) 

But Naay 1 = S(n • Say 2 ) 


for, in each column, 


S a y 2 — 


2d 2 

n 


since d represents a deviation from the mean of that column. For all columns, 

2 d 2 2 (n • Say 2 ) 

<r ° y2 ~ N ~ N 


Substituting in equation (4) 

Nay 2 = Na ay 2 4 2 n(My - m v ) 2 . (5) 

By definition of the standard deviation of the means of the columns 

2n(Afy — m v ) 2 

= at 


ay 2 — a ay 2 4 <Tmy 2 * 


Therefore, from (5), 


( 6 ) 
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O’ny 2 — Gy 2 &viy~» 

Substituting for or atf 2 in the equation 



we secure 



Since a my may be much more easily determined than a av 
the value of rj is generally computed from this formula. 
The data of Table .104 may be used to exemplify the process. 
Calculations appear in Table 106. 


Table 106 

Illustrating the Computation of the Correlation Ratio 


Type of array 
( X-value of 
items in 

Mean value 
of Y -items 

Deviation 
from mean 

Square of 

Fre- 


in column 

of all Y's 

deviation 

quency 


column) 

(pounds) 

(bushels) 

m y 

(25.005) 

d 

d* 

f 

id* 

10 

5.05 

- 19.955 

398.202 

21 

8,362.242 

30 

15.12 

- 9.885 

97.713 

25 

2,442.825 

50 

24.40 

- .605 

.366 

30 

10.980 

70 

28.73 

+ 3.725 

13.876 

44 

610.544 

90 

31.73 

+ 6.725 

45.226 

37 

1,673.362 

110 

32.40 

+ 7.395 

54.686 

20 

1,093.720 

130 

32.00 

+ 6.995 

48.930 

8 

391.440 

150 

33.33 

+ 8.325 

69.306 

6 

415.836 

170 

34.00 

+ 8.995 

80.910 

2 

161.820 

Total 




193 

15,162.769 


Gmy — 



15,162.769 

193 


= 8.864. 
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Substituting the given values in the formula 



we have 

_ 8.864 
Vvx 9.188 
= .965. 

The process of computing the correlation ratio may be 
briefly summarized: 

1. Arrange the items in the form of a correlation table. 

2. Find the arithmetic mean of all the 7-items in each column 

(i.e., find the arithmetic mean of each F-array of type X ). 

3. Compute the arithmetic mean of all the F’s. 

4. Measure the deviation of the mean of each column from the 

mean of all the F’s. Square each of these deviations and 
multiply by the number of items in the given column. Get the 
sum of the squared deviations. 

5. Divide this sum by the total number of items and extract the 

square root of the result. This gives the value of <r mv . 

6. Compute a v . 

7. Divide cr mv by <r„. The quotient is rj yz . 

The value of the correlation ratio of X on Y may be 
similarly computed, substituting the proper values in the 
formula 



The symbol <r mx represents the standard deviation of the 
means of the various rows about the mean of all the A’s. 
The value of the correlation ratio of X on Y depends upon 
the amount of scatter (horizontally) about the line joining 
the means of the rows. Its value will generally be different 
from that of the correlation ratio of Y on X. In the present 
case the value of is found to be .824. As the line of 
relationship approaches the linear form the two correlation 
ratios approach identity. 
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Like r, 77 can never exceed 1, this value being secured 
when there is no dispersion about the line joining the 
means of the columns (or rows). From the formula 



it is evident that the value of the correlation ratio is zero 
when <r mv is zero. This is the case when the mean of each 
column has the same value as the mean of all the F’s. 
Such a condition is found when an increase or decrease in 
the value of the X-variable brings no corresponding change 
in the value of the F-variable. This means that in each 
column of the correlation table there is a distribution of 
cases similar to the general distribution of F’s. When 
this is true there is clearly no relation between the two 
variables. 

The correlation ratio, it should be noted, never has a 
negative value. It is possible to determine by inspection 
of the correlation table, however, whether the relation 
between two variables is direct, or inverse, or a varying one. 

The coefficient of correlation has one distinct advantage, 
as compared with the correlation ratio, in that when its 
value and the values of the two standard deviations are 
known the equations to the lines of regression may be 
readily determined. This is not true of 77. To get a quantita- 
tive expression for the “law” of relationship between two 
variables, when 77 has been computed, an additional calcula- 
tion for the purpose of fitting a curve to the means of the 
arrays would be necessary. 

CORRECTION OP THE CORRELATION RATIO 

The use of 77 is only possible when the data are numerous, 
and can be arranged in the form of a correlation table. If 
a limited number of items should be so arranged, and it 
chanced that there was but one item in each column, the 
two measures <r my and <r v would be identical and 77 would 



422 NON-LINEAR CORRELATION 


necessarily have a value of 1 . Computed from a very small 
number of cases and employing a large number of classes, 
the correlation ratio would be meaningless. 

The raw correlation ratio may be corrected by the method 
employed on a preceding page for the index of correlation, 
with m set equal to the number of groups (i.e., to the num- 
ber of columns, for rj vx ) to the number of rows for rjxy). 
Thus, if rj be the corrected value, we have 

? - 1 - {<■ " ' 

In the present instance 

- 1 - {(1 - 9306)(j|^!)} 

= .9276 
V = .963. 

The correction is very slight in the present case, but if 
N were small or m very large it would reduce the given 
value materially. 

RELATION BETWEEN THE CORRELATION RATIO -AND THE 
COEFFICIENT OF CORRELATION 

When the relation between two variables is absolutely 
linear the line running through the means of the columns 
corresponds, of course, to the line upon which the coefficient 
of correlation is based. When this is the case rj and r have 
the same value. As the relationship between the two 
variables departs from the linear form the values secured 
for r) and r differ, 77 being always greater than r. This 
results from the fact that the scatter about a line joining 
the means of the columns will always be less than the 
scatter about a straight line fitted to these points, except 
when the straight line passes through every mean point. 
And the less the scatter about the line expressing the 
average relationship the greater the value of the measure 
of correlation. Thus for the alfalfa problem it was found 
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that r has a value of + • 69, and that an index of correlation 
based upon a second degree parabola has a value of .80. 
The correlation ratio for the same material is .82. . For 
the data of Table 104 the value of q ux (uncorrected) was 
found to be .965; the value of r is + .793, the difference 
between the two being marked. The reason for the difference 
is found in Fig. 83, in which the straight line of regression 
of Y on X and the line joining the means of the columns 
are shown. The regression departs materially from linearity, 
and the scatter about the straight line of regression is much 
greater than the scatter about the line joining the means. 

The relation between r and q affords a convenient test 
of linearity in a given instance, since the two values will 
be identical when the regression is strictly linear, and will 
differ the more widely the greater the departure from the 
linear form. The general test for linearity is 

f = V 2 ~ 7-2 ■ 

Even in a case of linear regression it is probable that q 
and r will differ somewhat because of fluctuations due to 
chance alone. A material difference, as reflected in the 
magnitude f (zeta), indicates that a straight line does not 
describe the relationship in question and that r is not a 
suitable measure of correlation. In the example given 
above, in which q equals .965 and r equals .793, the 
measure f has a value of . 302. (The uncorrected q is used 
in this test.) This is large enough to indicate that the regres- 
sion is non-linear. 

In later sections methods of testing for linearity are 
more fully discussed. 
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CHAPTER XIII 


ELEMENTARY PROBABILITIES AND THE 
NORMAL CURVE OF ERROR 

Reference has been made in an earlier section to the 
family resemblance which is found among frequency distri- 
butions drawn from widely different fields. Attention was 
also drawn to a certain basic type, represented graphi- 
cally by the symmetrical bell-shaped curve, which is called 
the “normal curve,” or the “normal curve of" error.” In 
an earlier day this curve was looked upon as representing 
a fundamental law which described all distributions of 
quantitative data. From the modern standpoint this was 
quite an erroneous conception. The normal curve is viewed 
today as but one of a number of types of curves which 
may be used to describe frequency distributions. It is, 
however, by far the most important type. For many of 
the measurements used to describe distributions of observa- 
tions (measurements such as the mean, the standard devia- 
tion, the coefficient of variation) are distributed in accord- 
ance with this normal law of error. The procedures employed 
in generalizing results obtained from the study of samples 
and, in particular, in determining the reliability of such 
generalizations, lean heavily upon this law. An under- 
standing of the characteristics of the normal curve is 
essential to the statistician. 

Elementary Theorems in Probability 

We may approach this subject by a brief consideration 
of certain elementary principles of probability that enter 
into many forms of statistical work. A detailed explanation 
of the theory of probability would carry us beyond the 

425 
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limits of the present volume. The treatment which follows 
is presented only as an introduction to the subject, designed 
to illustrate, by simple numerical examples, the relation 
between the principles of probability and the normal law 
of error. 

In this argument we may use the following standard 
notation. If an event can occur in n ways, a of which are 
to be considered as successful and b as unsuccessful, the 
probability p of a successful outcome may be written 

a 

V — - 

n 

and the probability q of an unsuccessful outcome may be 
written 



Since the sum of the favorable and unfavorable outcomes 
is equal to the total number of events, we have 


Dividing by n, 


a + b = n. 


-+ - = i 
n n 


so that 

V + Q = 1 

or certainty. 

A probability, therefore, may be written as a ratio. The 
numerator of the fraction corresponding to this ratio repre- 
sents the number of favorable (or unfavorable) outcomes, 
while the denominator represents the total number of 
possible outcomes. 


EXAMPLES OP SIMPLE PROBABILITIES 

If a coin be tossed, the turning up of a head being looked 
upon as a favorable outcome, we have, as the probability 
of a success, 

1 
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and of a failure, 


Q 


1 

2 


If we roll a die, regarding a six spot as a favorable outcome, 

1 

P = 6 


and 


5 

9 = 6 

If a card be drawn from a pack of 52 the chance of drawing 
the ace of spades is -h, of failing in that endeavor, 


THE ADDITION OF PROBABILITIES 


What is the chance of securing either an ace of spades 
or a two of spades in a single draw from a pack of 52 cards? 
In such a case, where any one of several outcomes will be 
considered as favorable, the probability of a success is 
the sum of the separate probabilities. In this example 


V = 


1 + 1 = 
52 ' 52 


26' 


The chance of drawing either a heart or a spade from a 
pack of playing cards is given by 

13 , 13 1 

V 52 + 52 2 


THE MULTIPLICATION OF PROBABILITIES 

Two events are said to be independent when the outcome 
of one does not affect the outcome of the other. Thus the 
result of one throw of a die does not, presumably, affect 
the result of the next toss. The probability of a compound 
event (i.e., that two events, independent of one another, 
will both occur) is the product of the probabilities of the 
separate events. Thus the chance of securing an ace, 
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followed by a two spot, in two successive throws of a die, 
is given by 

1^1 1 
v 6 X 6 36 

In computing the probability of a given outcome it is 
frequently necessary both to multiply and to add probabili- 
ties. For example, we wish to determine the chance of 
securing the total 5 from two dice thrown simultaneously. 
We may label the dice a and b to distinguish them. This 
total may be secured from any one of the four following 
combinations: 


Die a Die b 

1 4 

2 3 

3 2 

4 1 

The chance of securing an ace with die a is &, of secur- 
ing a 4 with die b is i. The chance of the two in combi- 
nation is gV. Similarly, the probability of each of the other 
three combinations is -h- But any one of these four re- 
sults will give a total of 5, and will be considered success- 
ful. Hence 


I I A . X J. 

P- 36 + 36" h 36 + 36~9' 

We have in this example answered the question: What 
is the probability of securing exactly 5 in the toss of two 
dice? We might put the question: What is the chance of 
securing at least 5 in the toss of two dice? In this case a 
total of 5 or more will be considered a favorable outcome. 
Just as in the preceding example, we may work out the 
probability of securing each of the results which will be 
accepted as successful. The following summary indicates 
the probability of each of these totals: 
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Probability of throwing 12 

with two dice 

l 

36 

if 

u a 

ii 

a 

tt 

tt 

2 

36 

a 

u it 

10 

tt 

tt 

tt 

3 

36 

u 

a a 

9 

a 

tt 

tt 

*18: 

II 

t: 

a a 

8 

tt 

tt 

tt 

II 

81 

u 

a tt 

7 

tt 

tt 

tt 

ii 

81® 

u 

a a 

6 

tt 

tt 

tt 

**18 

ii 

u 

tt a 

5 

tt 

tt 

tt 

*18 | 
ii 


Sum of above probabilities 

30 

36 


The chance of throwing at least 5 in the toss of two dice is, 
therefore, M or f. 

The Binomial Expansion and the Measurement of 

Probabilities 

It is possible to express these facts in a generalized form. 
A simple illustration may be employed to exemplify the 
derivation of the general expression. 

If two coins are tossed simultaneously there are four 
possible outcomes 

a b a b a b a b 

TT TH HT HU. 

(The two coins are represented, respectively, by the letters 
a and b.) The chances of securing no heads, one head, and 
two heads are, respectively, 1, £, and If three coins 
(represented by the letters, a, b, and c) are tossed simul- 
taneously, we have eight possible outcomes 

a b c a b c a b c a b c a b c a b c a b c a b c 

TTT TTH THH THT HTT HTH HHT HHH. 
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The chances of securing no heads, 1 head, 2 heads, and 
3 heads are, respectively, i, I, f, &. 

But these results may be derived without working out 
the separate probabilities in detail. We have employed 
p and q to represent, respectively, the probability of success 
and failure of a given event. If there are two independent 
events the compound probabilities are given by the expansion 
of the expression 


(p + q) 2 - 


For the case in which p (e.g., the probability of throwing 
a head) = q = f, the probabilities of the various results 
are given by 



4 ^ 2 ^ 4 


These are the results secured in the first example cited 
in this section. If there are three independent events, 
with p = q »= §, we have 



= ! , ? , 3 1 

8 "a "o’ 


the probabilities secured in the second example. 

If we wish to know not the separate probabilities but the 
probable frequencies of the various outcomes in a given 
number of trials, these may be computed from the expression 


N(p + q)” 


where N represents the number of trials and n the number 
of independent events. Thus if there are 200 trials and 
there are two independent events, the probable frequencies 
are given by 

200 (p + q) 2 = 200 (p 2 + 2 pq + q 2 ). 

With p = q = \ this gives us 

20o(|) + 200^0 + 20oQ = 50 -f 100 + 50 
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which indicates the probable frequencies of 2 successes, 
1 success, and no successes. 

If there are three independent events, the probable 
frequencies in N trials are determined from the binomial 
expansion of 

N(p + g) 8 . 

If N equals 200, we have 

200 (p 3 + 3 p 2 q + 3pg 2 + g 8 ). 

If p equals £, we have 

2 O0(i) + 20 0(?) + 2° o (?) + 2O0(i) 

These terms indicate, in order, the probable frequencies 
of 3 successes, 2 successes, 1 success, and no successes. 
The total frequencies secured by carrying through the 
process of multiplication will be equal to the number of 
trials, for all possible outcomes are covered by the expansion. 

Thus, when we know in advance 1 the probabilities attach- 
ing to similar but independent events, we may determine the 
probable frequencies of any given number of successes or 
failures. This is true whether p and q be equal or unequal. 
It is necessary only that p and q remain constant. There 
is here a fact of great significance in the development of 
statistical theory. 

A COMPARISON OF ACTUAL AND THEORETICAL FREQUENCIES 
IN THE REALM OF PURE CHANCE 

Certain points of importance may be made clear by 
comparing some experimental results with the theoretical 
frequencies given by the binomial expansion. Twelve dice 

1 A distinction is generally drawn between a priori probabilities of the type 
described above, and empirical probabilities, knowledge of which is derived 
from observation or experience. As an example of the latter type we have, 

74 173 

as the probability that a man aged 35 will live 10 years, the ratio This 

is based upon the American Experience Table of Mortality which shows that 
of 81,822 men living at the age of 35, there are 74,173 living ten years later. 
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were thrown a number of times. Each 4, 5, or 6 spot 
appearing was considered to be a success, while a 1, 2, 
or 3 spot was a failure. (In a typical throw we might have 
the following spots up: 3, 1, 5, 1, 2, 4, 4, 6, 3, 2, 3, 5. In 
this lot there are five successes, and the result is so tallied.) 
In a classical example recorded by W. F. R. Weldon 1 
twelve dice were thrown in this way 4,096 times, a success 
being defined as above. The results are recorded in column 
(2) of Table 107, and the distribution is shown in Fig. 84. 
By computation we find the arithmetic mean and the 
standard deviation of this distribution to be, respectively, 
6.139 and 1.712. 

Let us compare with these results those which we might 
expect from the given conditions. Twelve dice were thrown 
each time, hence we are dealing with 12 independent 
events. There were 4,096 trials. Since either a 4, 5, or 6 is 
considered a success, p = q = i . 

For the terms in the binomial expansion we have 


(p + q) n = P n + np n ~ l q + - ~~ P" - V + * 9 * 

+ . . . + q n . 

In the present case we have 

4 -° 96 (§ + 0 ”' 


Expanding 

HA ' 12 


+ 


66 220 495 792 924 

i a r\r\c* i a nr\a ' a f\c\ct ' a f\c\a * 


096 4,096 ^ 4,096 4,096 ' 4,096 1 4,096 1 4,096 

J_Y 

,096/ 


, 792 495 220 66 12 

' a nc\a ' a nnc * a nna * a nnc ' a nnc ' 


4,096 ' 4,096 ' 4,096 1 4,096 1 4,096 1 4,1 


Completing the indicated multiplication we have the theo- 
retical frequencies of the various possible successes in 
4,096 throws of twelve dice. These are shown in column 
(3) of Table 107. 


Cited by F. Y. Edgeworth, Encycl. Brit., 11th ed., VoL XXII, 394. 
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Table 107 

Comparison of Actual and Theoretical Frequencies in Dice-Rolling 

Experiment 


( 1 ) 

( 2 ) 

(3). 

. Number of 

Observed 

Theoretical 

successes 

frequencies 

frequencies 

0 

0 

1 

1 

7 

12 

2 

60 

66 

3 

198 

220 

4 

430 

495 

5 

731 

792 

6 

948 

924 

7 

847 

792 

8 

536 

495 

9 

257 

220 

10 

71 

66 

11 

11 

12 

12 

0 

1 


4,096 

4,096 


The distribution of the theoretical frequencies is shown 
in Fig. 84, with that of the observed frequencies. The 
relationship of the two distributions is close. 

When we have, as in this case, a knowledge of the 
probabilities involved, it is possible to determine the arith- 
metic mean and the standard deviation of the distribution 
of the theoretical frequencies. As a general expression for 
the mean number of successes, where the number of inde- 
pendent events and the probability of success are known, 
we have 

M = np. 

Applying the present values, 

M = 12 X \ = 6. 

The mean, as computed from the observed frequencies, is 
6.139. 
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As a general expression for the standard deviation, 1 under 
the same conditions, we have 

a = Vnpq. 

In the present case 

= y 12 X ^ X 2 = V3 



Number of Successes 

Fig. 84. — A Comparison of Actual and Theoretical Frequencies in a 
Dice-Rolling Experiment 

The standard deviation, as computed from the actual 
frequencies, is 1.712. 

When proportions, or relative frequencies, are dealt with, 
the standard deviation (a') may be derived from the relation 



x This formula for the standard deviation of a binomial distribution is of 
central importance. The derivation of this formula, and that for the mean of 
a binomial distribution, are given in Appendix B. 
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The Normal Curve of Error 


We may return to a consideration of the curve in Fig. 84 
which represents the theoretical frequencies in the dice- 
throwing experiments. It is a perfectly symmetrical 12-sided 
polygon, the number of sides (excluding the base) corre- 
sponding to the number of independent events in the particu- 
lar problem considered. With six events we should have a 
six-sided figure, with twenty events a twenty-sided figure, 
and so on. It is obvious that, as n increases, the number 
of sides to the polygon increasing correspondingly in num- 
ber, the graph representing the expansion of the binomial 
(p + q) n approaches more and more closely a smooth curve. 
With n infinitely large a perfectly smooth curve would be 
secured. This is the normal curve of error which has been 
plotted in Fig. 85. 

The equation to this curve is written in several forms, of 
which 

x* 

2<r* 

y = y 0 e 

is one. In this equation y„, the maximum ordinate, is a 
constant; e is a constant (the base of the Napierian loga- 
rithms) having a value of 2.71828; a represents the stand- 
ard deviation; and x is a given value of the dependent 
variable expressed as a deviation from the mean. The maxi- 
mum ordinate may be derived from the relation 
N 

y ° crVZir 


hence the equation to the normal curve may be written 


y = 


N - 
<r \Z2ir 6 


x' 


where r is the constant 3 . 14159. 

This equation may be derived in several ways. 1 One 

1 Gauss’ deduction of the error equation may be found in all standard works 
on the theory of least squares. Cf. references at end of Appendix A. 
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procedure which throws light on the physical conditions 
giving rise to the emergence of a normal distribution, starts 
from three basic assumptions. 

1. The causal forces affecting individual events are 
numerous, and of approximately equal weight. 

2. The causal forces affecting individual events are 
independent of one another. 

3. The operation of the causal forces is such that devia- 
tions above the mean of the combined results are balanced 
as to magnitude and number by deviations below the mean. 

A great part of the power which modern statistical 
technique possesses is derived from the detailed knowledge 
of the characteristics of the normal or Gaussian curve. 
From prepared tables showing the fractional parts of the 
total area under the curve lying between ordinates erected 
at stated distances from the maximum ordinate, theoret- 
ical frequencies may be determined much more readily 
than by the laborious method based upon the binomial 
expansion. 

USE OF A TABLE OF AREAS UNDER THE NORMAL CURVE 

The entire area under a frequency curve is taken to 
represent the total number of frequencies. Given information 
as to the proportion of the total area within a given segment, 
it would be easy to compute the frequencies represented 
by this segment, or to determine the probability that a 
given observation from the population represented by the 
curve would fall within the limits of this segment. Prepared 
tables of the probability integral, of which Table 108 is an 
example, serve just this purpose, with respect to the normal 
curve. (A more detailed table than that here given is 
needed for accurate computation. Appendix Table I will 
serve most purposes. 1 ) 

1 Tables of areas under the normal curve, as calculated by Dr. W. F. Shep- 
pard, are available in many publications. Cf. Tables for Statisticians and Bio- 
metricians, edited by Karl Pearson, Biometric Laboratory, University College, 
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Table 108 

Area of the Normal Curve in Terms of Abscissa 
(Giving fractional parts of the total area between y Q and ordinates erected 
at varying distances from y Q ) 


x/<r 

a 

x/a 

a 

0.0 

.00000 

2.0 

.47725 

0.1 

.03983 

2.1 

.48214 

0.2 

.07926 

2.2 

.48610 

0.3 

.11791 

2.3 

.48928 

0.4 

. 15542 

2.4 

.49180 

0.5 

. 19146 

2.5 

.49379 



2.5758 

.49500 

0.6 

.22575 

2.6 

.49534 

0.7 

.25804 

2 7 

.49653 

0.8 

.28814 

2.8 

.49744 

0.9 

.31594 

2.9 

.49813 

1.0 

.34134 

3.0 

.49865 

1.1 

.36433 

3.1 

.49903 

1 2 

.38493 

3 2 

.49931 

1.3 

.40320 

3 3 

.49952 

1.4 

.41924 

3.4 

.49966 

1.5 

.43319 

3.5 

.49977 

1.6 

.44520 

3.6 

.49984 

1.7 

.45543 

3.7 

.49989 

1.8 

.46407 

3.8 

.49993 

1.9 

.47128 

3.9 

.49995 

1.96 

.47500 

4.0 

.49997 


Since the normal curve is symmetrical about the maxi- 
mum ordinate, the values given in Table 108 apply to 
observations on both sides of the mean. In using such a 
table, deviations from the mean are first expressed in units 
of the standard deviation. (The term normal deviate is 
applied to such a quantity, that is, to a deviation from the 
mean of a normal distribution expressed in units of the 
standard deviation of that distribution.) The proportion 

London; Tables of Applied Mathematics, J. W. Glover, Ann Arbor, Michigan, 
George Wahr; Manual of Problems and Tables in Statistics, F. C. Mills and 
D. II. Davenport, New York, Henry Holt and Co. 
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of the total area lying between any two ordinates may then 
be readily determined. For example: What proportion of 
the cases in a normal distribution lies between the maximum 
ordinate and an ordinate erected at a distance from the 
mean equal to + <r? Reading down the x/a column to 1.0, 
we find the value .34134 opposite it. This, in ratio form, 
is the proportion of cases falling within the limits indicated. 



Fig. 85. — An Illustration of the Measurement of Areas under the Normal 

Curve 

Expressing this ratio as a percentage, we have 34 . 134 per 
cent as the answer to our question. 

Fig. 85 shows the relation of this area (the shaded area A) 
to the total area under the curve. 

What proportion of the total number of cases in a normal 
frequency distribution will fall between an ordinate erected 
at a distance from the mean equal to — 1.4<r and one 
erected at - 2a? From the table we find that 41.924 per 
cent of the total area will lie between y 0 and the ordinate 
at — 1 . 4<r ; 47.725 per cent will lie between y„ and the 
ordinate at — 2a. The difference, 5.801 per cent, will fall 
between the ordinates at — 1 . 4<r and at — 2a. This may 
be converted into actual frequencies by taking this propor- 
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tion of the total number of cases in the given distribution. 
The shaded segment B in Fig. 85 represents the area thus 
marked off. 

For certain purposes we wish to know the proportion 
of the total number of cases deviating by a stated amount 
or more in either direction from the mean of a normal 
distribution. If we wish to know the proportion of all 
cases deviating from the mean by 1 . 96<r or more, we must 
add to the area between + 1.9 Gcr and the upper limit of 
the curve the area between — 1 . 96<r and the lower limit of 
the curve. Each of these areas equals .50000 — .47500, 
or . 025. The percentage of cases deviating from the mean 
by + 1.96<r or more is 2.5; the percentage deviating by 
— 1.96<r or more is 2.5. The percentage deviating above 
or below the mean by 1 . 96<r or more is 5.0. Similarly, 
it may be determined from the entries in Table 108 that 
just one per cent of all the cases in a normal distribution 
will deviate from the mean, positively or negatively, by 
2.5758<r, or more. This “one per cent” area is represented 
by the sum of the shaded portions at the two tails of Fig. 85. 
The ordinates defining the inside limits of these segments are 
erected at + 2 . 5758a and at — 2 . 5758 a, while the outer 
limits are at infinity. 

Special significance attaches to the two limits last men- 
tioned, because of the uses made of them in interpreting 
errors of sampling. This topic is developed at a later point. 
Here we may note that the figures defining proportions of 
the total area under the normal curve falling in given 
areas may also be interpreted as probabilities. The proba- 
bility that a given observation, made at random in a popu- 
lation distributed according to the normal law of error, 
will fall between the mean and a value one standard devia- 
tion above the mean is .34134; the probability that a given 
observation will deviate from the mean by 1.96<r or more 
is .05; the probability that a given observation will deviate 
from the mean by 2.5758<r or more is .01. 
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The method by which probabilities of occurrence may 
be determined from a table of areas under the normal 
curve, and by which the significance of a given normal 
deviate may be established, should be clearly understood. 
These methods enter in many ways into the work of a 
statistician. 

The uses of the normal curve of error, and of the table 
of areas based upon the integration of this curve, are too 
varied to be enumerated at length here. A simple example 
may serve to introduce the subject. 

An Economic Application 

The statistical division of the American Telephone and 
Telegraph Company has made a study of the annual 
message use of four-party line residence message rate sub- 
scribers in Buffalo. The annual messages for each of 995 
subscribers were tabulated and classified. 1 The results, 
together with certain computations, appear in Table 109. 

THE MOMENTS OF A FREQUENCY DISTRIBUTION 

Some ter ms and symbols that have not been employed 
heretofore may be introduced at this point. We may 
write, using v (nu) to define certain quantities of interest 
to us, 

first moment of the distribution about the arbitrary 
origin. 

second moment of the distribution about the arbi- 
trary origin. 

third moment of the distribution about the arbi- 
trary origin. 

fourth moment of the distribution about the arbi- 
trary origin. 

*“ Introduction to Frequency Curves and Averages.” Statistical Bulletin, 
Statistical Methods Series, No. 1. Issued by Chief Statistician, American Tele- 
phone and Telegraph Co. 
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Table 109 

Annual Message Use of 995 Telephone Subscribers 
(Illustrating the computation of the moments of a frequency distribution) 


(1) 

(2) 

(3) 

(4) 

Deviation 

(5) 

(6) 

(7) 

(8) 

Interval 
of message 
use * 

Mid- Fre- 
point quency 

from arbi- 
trary origin 
in class-in- 








terval units 






m 

/ 

x > ' 

/*' 

fix')* 

fix 0* 

fix' V 

0- 50 

25 

0 

- 10 

0 

0 

0 

0 

50- 100 

75 

1 

- 9 

- 9 

81 

- 729 

6,561 

100- 150 

125 

9 

- 8 

- 72 

576 

- 4,608 

36,864 

150- 200 

175 

19 

- 7 

- 133 

931 

- 6,517 

45,619 

200- 250 

225 

38 

- 6 

- 228 

1,368 

- 8,208 

49,248 

250- 300 

275 

50 

- 5 

- 250 

1,250 

- 6,250 

31,250 

300- 350 

325 

95 

- 4 

- 380 

1,520 

- 6,080 

24,320 

350- 400 

375 

85 

- 3 

- 255 

765 

- 2,295 

6,885 

400- 450 

425 

115 

- 2 

- 230 

460 

- 920 

1,840 

450- 500 

475 

132 

- 1 

- ±32 

132 

- 132 

132 

500- 550 

525 

144 

0 

0 

0 

0 

0 

550- GOO 

575 

116 

1 

116 

116 

116 

116 

600- 650 

625 

79 

2 

158 

316 

632 

1,264 

650- 700 

675 

54 

3 

162 

486 

1,458 

4,374 

700- 750 

725 

31 

4 

124 

496 

1,984 

7,936 

750- 800 

775 

11 

5 

55 

275 

1,375 

6,875 

800- 850 

825 

5 

6 

30 

180 

1,080 

6,480 

850- 900 

875 

6 

7 

42 

294 

2,058 

14,406 

900- 950 

925 

2 

8 

16 

128 

1,024 

8,192 

950-1,000 

975 

1 

9 

9 

81 

729 

6,561 

1,000-1,050 

1,025 

1 

10 

10 

100 

1,000 

10,000 

1,050-1,100 

1,075 

1 

11 

11 

121 

1,331 

14,641 



995 


- 956 

9,676 

- 22,952 

283,564 


“Moment” is a familiar mechanical term for the measure 
of a force with respect to its tendency to produce rotation. 
The strength of this tendency depends, obviously, upon the 
amount of the force and the distance of the point at which 
the force is exerted from the origin. The term is used in sta- 


* As here classified an item having a value of 50 was put in the class having 
50 as an upper limit. Items falling on other class limits were similarly disposed 
of. 
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tistics in a quite analogous sense, the class-frequencies being 
looked upon as the forces in question. The size of each class- 
frequency and the distance of each class midpoint from the 
origin are the factors of prime importance in this respect. 
The moments of a distribution about any origin may be 
computed by multiplying the frequency of each class by 
a given power of its distance, along the x-axis, from the 
origin, s umming the resulting products and dividing by 
the number of cases. If the first moment is desired, the 
first power of the x-distance is employed; if the fourth 
moment, the fourth power of the x-distance, etc. The 
subscripts indicate the moments represented by the various 
symbols. 

The most significant moments, for statistical purposes, 
are those which relate to the arithmetic mean as origin. 
Representing these moments by ir (pi) 1 we have the 
following relationships : 

First moment about the mean = tti — 0. 

Second “ “ “ “ =T t = v*~ Vl \ 

Third “ “ “ “ = ir 3 = v* — 3vip 2 + W- 

Fourth “ “ “ “ = 7 t 4 = -f Gvfa.— 3»<i 4 . 

The computation of these moments from the data, as 
classified, involves the assumption that the items in each 
class can be treated as though they were concentrated at 
the mi dpoint of that class. It has been established that, 
under certain conditions, calculations made on this assump- 
tion are subject to a constant error. In particular, it has 
been shown that the values of the second and fourth 
moments are not the same, when computed from grouped 
data, as when computed from ungrouped data. 

W. F. Sheppard 2 has worked out certain corrections for 
this bias. His corrections may be applied when two 
conditions prevail : 

1 In the equation to the normal curve v represents the familiar constant, 
3. 14159. As a symbol for a moment about the mean it relates, of course, to 
no such constant value. 

2 Cf. Proceedings of the London Mathematical Society , Vol. XXIX, 353-380. 
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(1) When the distribution relates to a continuous variable. 

(2) When the frequency curve is characterized by “high 
contact, ” i.e., when the frequency curve tapers off gradually 
in both directions. 

The symbol ju (mu) is employed to represent a corrected 
moment about the mean. The application of Sheppard’s 
corrections gives us the following final formulation: 

Mi = 0 

1 

fli — Tt j2 

M3 = TT» 

1 , _ 7 _ 

M4 ** 2 7T 2 + 240' 

(In applying the corrections fa and the correspond- 
ing decimal values, .083333 and .029167, will generally 
be employed.) It is assumed in making these corrections 
that a class-interval unit has been employed in measuring 
deviations from the mean. 

It may be noted in passing that the standard deviation is 
the square root of the second moment about the mean. For 
the uncorrected value, 

a = Virz. 

If Sheppard’s corrections 1 are to be applied 

a = Vn 2 . 

The calculation of the moments of the frequency distribu- 
tion of telephone subscribers is shown on page 444. Shep- 
pard’s corrections are applied, since the curve is marked by 
reasonably high contact. It is a discontinuous distribution, 
but the uni t (1) is so small in comparison with the range 
that it may be treated as continuous. 

1 It should be noted that these corrections, when appropriate, are applicable 
to the standard deviations entering into the calculation of the coefficient of 
correlation. 



444 THE NORMAL CURVE OF ERROR 


Vl = — jg§ = - .960804 


Vi = 


995 
9,676 


995 


= 9.724623 


- 22,952 = _ 23.067337 
3 995 

Vi = 28 ^ 64 - = 284.988945 


7Ti = 0 

n = Vi- vS = 9.724623 - .923144 = 8.801479 
Xl . v 3 - 3w + 2vi* = - 23.067337 + 28.030370 

- 1.773922 = 3.189111 

7T4 = V4 — 4v t V3 -(- Ol'i V 2 3vi 4 

= 284.988945 - 88.652760 + 53.863384 - 2.556586 
= 247.642983 


M i = 0 

^ = X2 _ 1 = 8.801479 - .083333 = 8.718146 
1 

/X3 = 7T3 = 3.189111 

/i4 = X4 - + gio = 247.642983 - 4.400739 + .029167 

= 243.271411 


CRITERIA OF CURVE TYPE 

Having these values, we may return to a consideration 
of the main problem, the utilization of our knowledge of 
the normal curve. There are certain criteria, represented 
by the letters /3 (beta) and k (kappa), which enable us 
to determine readily whether a given distribution may be 
described by a curve of the normal type. These may be 
derived from the corrected moments of the given distribution. 


ft - M3 - 
Pi ; — 

P2 3 


10.170429 

662.632015 


= .01534853 


JU4 243.271411 
^ 2 ~ H 2 2 76.006070 


= 3.200683 


ft(ft + 3) 2 

4(4/3 2 - 3ft) (2ft - 3/3! - 6) 


Ki - 
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. 01534853 X 38 . 448470 . 5901275 

Ko = = 

4(12. 756686) (.355320) 18.130823 

k 2 = .032548 

For the normal curve these criteria have the following 
values: 

ft = 0 
ft = 3 
*2 = 0 

We may conclude, tentatively, that the normal curve 
may be used to describe the given distribution. 1 

Fitting a Normal Curve; Use of a Table of Areas 

The process of fitting a normal curve to a set of observa- 
tions involves the computation of theoretical frequencies 
corresponding to the observed frequencies. This may be 
done from a table of areas under the normal curve (see 
Appendix Table I). Using such a table, in the manner indi- 
cated in the preceding section, the areas between the maxi- 
mum ordinate and ordinates erected at the various class 
limits may be determined. By the simple process of subtrac- 
tion the area within each class, and hence the theoretical 
frequencies, may then be computed. The procedure is illus- 
trated in Table 110 on page 446, relating to the distribution 
of telephone subscribers. 

The theoretical distributions derived from this fitting 
process may be compared with the observed frequencies, 
as given in Table 109. Or the comparison of the actual 
distribution and the fitted curve may be made graphically, 
as in Fig. 86. It is apparent by inspection that the normal 
curve gives a fairly good fit to the data, although there 
are several classes in which the differences are marked. A 
natural question arises as to the reason for the failure of 
the normal curve to fit at all points. There are two possible 

1 Account is later taken of the bearing of errors of sampling on this con- 
clusion. See Chap. XVIII. 
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Table 110 

Illustrating the Computation of Theoretical Frequencies from a Table 

of Areas 


(4) (5) 

Number of 
cases between 


(1) 

(2) 

(3) 


Deviation 

Proportion of 

Class 

from 

area between y 0 

limit 

mean 

and ordinate 


X 

at " 


o 

<r 

0 

— 3.23 

.4993810 

50 

- 2.89 

.4980738 

100 

- 2.55 

.4946139 

150 

- 2.22 

.4867906 

200 

- 1.88 

.4699460 

250 

- 1.54 

.4382198 

300 

- 1.20 

.3849303 

350 

- .86 

.3051055 

400 

- .52 

. 1984682 

450 

- .18 

.0714237 

500 

+ .16 

.0635595 

550 

+ .495 

. 1896931 

600 

+ .83 

.2967306 

650 

+ 1.17 

.3789995 

700 

+ 1.51 

.4344783 

750 

+ 1.85 

.4678432 

800 

+ 2.19 

.4857379 

850 

+ 2.53 

.4942969 

900 

+ 2.87 

.4979476 

950 

+ 3.20 

.4993129 

1,000 

+ 3.54 

.4997999 

1,050 

+ 3.88 

.4999478 

1,100 

+ 4.22 

.4999878 


y 0 and ordi- 

Theoretical frequencies. 

nate 

by classes 

at - 

a 




496.88 




495.58 

0- 

50 

1.92* 

492.14 

50- 

100 

3.44 

484.36 

100- 

150 

7.78 

467.60 

150- 

200 

16.76 

436.03 

200- 

250 

31.57 

383.01 

250- 

300 

53.02 

303.58 

300- 

350 

79.43 

197.48 

350- 

400 

106.10 

71.07 

400- 

450 

126.41 

63.24 

450- 

500 

134.31 

188.74 

500- 

550 

125.50 

295.25 

550- 

600 

106.51 

377.10 

600- 

650 

81.85 

432.31 

650- 

700 

55.21 

465.50 

700- 

750 

33.19 

483.31 

750- 

800 

17.81 

491.83 

800- 

850 

8.52 

495.46 

850- 

900 

3.63 

496.82 

900- 

950 

1.36 

497.30 

950-1 

,000 

.48 

497.45 

1,000-1 

,050 

.15 

497.49 

greater 

than 



1,050 

.05 




995.00 


answers to such a question. The failure to fit may be due 
merely to chance fluctuations such as are found in any 
sample. We may have an underlying law of distribution 
of residence subscribers, classified by message use, which 

* The theoretical distribution shows . 62 of a case below — 3 . 23cr. To pre- 
serve formal consistency this amount has here been added to the theoretical 
frequency between 0 and 50. 
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accords perfectly with the normal law of error, but the 
particular sample selected may be marked by certain 
irregularities which would be ironed out if a very large 
number of cases were included. On the other hand, the 
differences may be due to the fundamental failure of such a 
distribution to accord with the normal law of error. Such 
a law may not describe the distribution of telephone calls, 
in which case the normal curve should not be employed. 

At this stage we may note, without discussion, that the 
differences between theoretical and observed frequencies in 



Fig. 86. — Illustrating the Fitting of a Normal Curve to Frequency 
Distribution of Telephone Subscribers, Classified according to Message 
Use 


the present example are small enough to be attributed to 
chance fluctuations of sampling. The reasoning that sup- 
ports this conclusion is presented in a later section (Chapter 
XVIII). The evidence is clear, however, that the discrep- 
ancies between the observed frequencies and those in the 
corresponding normal distribution are not excessively large. 
The observed facts are not inconsistent with the hypothesis 
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that residential telephone subscribers, classified according 
to frequency of telephone use, are distributed in accordance 
with the normal law of error. 

This conclusion gives generality to the results of our 
study. We have a great deal of information concerning 
the attributes of distributions following the normal law 
of error, and once £he identification of an actual distribu- 
tion with this standard type has been effected we may 
draw upon this store of knowledge. In using the original 
frequency table we are limited to the classes there estab- 
lished. We may now go beyond this and determine how 
many cases may be expected within stated limits. We may 
compute the probability of a case falling between any two 
points on the x-scale, or above or below any given value. 
The observed results, standing alone, are restricted in their 
significance to the particular observations recorded, but 
the theoretical frequencies have no such limitations. They 
apply generally, to the entire population from which the 
sample was drawn. In so far as we are assured of the repre- 
sentative character of our sample we have a basis for 
inference that would be afforded by no amount of study 
of the particular distribution as a thing apart. This fact, 
that a knowledge of the theoretical frequencies permits 
generalization beyond the limits of direct observation, is 
perhaps the most important of the advantages derived 
from the identification of an actual distribution with an 
ideal type, such as the normal distribution. 1 

NOTE ON THE DESCRIPTION OF THE FREQUENCY DISTRIBUTION 

With the aid of the criteria explained in this chapter it is possible 
to describe a frequency distribution more accurately than is possible 
with the measurements employed in the earlier chapters. A treat- 

1 As was stated, the normal curve is but one type of frequency curve, though 
one of basic importance. A comprehensive system of frequency curves is that 
associated with the name of Karl Pearson, who has derived equations to and 
has described in detail a number of standard types. An account of other 
fundamental types will be found in the books by Arne Fisher referred to at 
the end of this chapter. 
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ment of this subject is beyond the scope of the present book, but it 
seems advisable to indicate briefly the nature of these additional 
measures. 

The value of fa serves as a measure of the degree of “flat-topped- 
ness” found in a given curve. If fa = 3, as in the normal type, the 
curve is said to be mesokurtic . If fa < 3 the curve is platykurtic , or 
flatter than the normal type. If fa > 3, as in the example given 
above, the curve is leptokurtic , or more peaked than the normal. 

A measure of skewness which is more accurate than those given 
early in the book may also be computed from these criteria. 
Karl Pearson has shown that the quantity 

_ VftCfe + 3) 

X - 2(5ft - 6ft - 9) 

serves as a measure of the degree of asymmetry of a given curve. 
Inserting the values of fa and fa given above we have, in the case 
of the distribution based on message use, 

X = - .05558. 

(x is positive if the mean is greater than the median, negative if 
the mean is less than the median. In the present case the value of 
the mean is 476.96, that of the median is 482.39, hence the 
skewness is negative.) 

Finally, the distance, d , between the mean and the mode may 
be determined from the relation 

d = x X a. 

In the distribution described above (relating to telephone use) <r, 
in original units, equals 147 . 65. Hence 

d = - .05558 X 147.65 = - 8.21. 

Since 

Mo = M -d 

we have 

Mo = 476.96 + 8.21 = 485.17. 

This gives a truer approximation to the modal value than any of 
the methods discussed in Chapter IV. 

The methods exemplified in Table 109 and the accompanying 
text provide, therefore, a straightforward procedure for the meas- 
urement of the essential attributes of a frequency distribution. 
The mean and mode as measurements of central tendency, the 
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standard deviation as a measure of dispersion, x as a measure of 
skewness, and /3 2 — 3 as a measure of the degree of concentration 
of observations near the point of maximum frequency, may be 
computed directly from the first four moments of a distribution. 
These methods are available, of course, whether or not a study is 
to be carried to the point of determining and fitting a frequency 
curve of an appropriate ideal type. They are to be recommended 
for use in any systematic study of frequency distributions. 
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CHAPTER XIV 


STATISTICAL INDUCTION AND THE PROBLEM 
OF SAMPLING 

The preceding pages have been devoted to an account of 
tools employed in statistical analysis. Examples illustrating 
the application of these tools to specific problems have 
been presented, but the emphasis throughout has been on 
technique. It is appropriate at this point that we stand off 
a distance, enlarging our perspective, and consider certain 
general problems relating to the application of these tools. 
What is their proper place in economic and business research? 
What are the assumptions involved in using them and 
what are their limitations? What are the end products 
of statistical analysis? How valid are the conclusions 
reached? What restrictions attach to such conclusions? 
We must give thought to such questions as these, if statistical 
methods are to be intelligently applied. 

Statistical Description and Statistical Induction 

In approaching this subject we must first make clear 
the distinction between statistical description and statistical 
induction. By employing the methods of statistics it is 
possible, as we have seen, to describe succinctly a mass 
of quantitative data. Hundreds or thousands of individual 
cases may be classified, and a frequency distribution formed. 
The essence of this distribution may be boiled down to 
perhaps four measures — of central tendency, variation, 
skewness, and kurtosis. A tremendous gain has been realized 
in thus replacing the multiplicity of individual cases by a 
limited number of measures that define the characteristics 
of the group as a whole. The possession of such tools makes 
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it possible for our limited powers of perception to grasp 
the significance of facts in the mass. Again, the methods 
of statistics enable us to describe relations between variable 
quantities. By securing the equation to an appropriate 
curve fitted to the data by mathematical methods, we 
may determine how much, on the average, one quantity 
changes in value as a related factor varies. This may be 
supplemented by a measure of the scatter or dispersion 
about the fitted curve, and by a measure, in abstract 
terms, of the degree of correlation between the dependent 
and the independent variables. 

In so far as the results are confined to the cases actually 
studied, these various statistical measurements are merely 
devices for describing certain features of a distribution, or 
certain relationships. Within these limits the measures 
may be used with perfect confidence, as accurate descrip- 
tions of the given characteristics. But when we seek to 
extend these results, to generalize the conclusions, to apply 
them to cases not included in the original study, a quite 
new set of problems is faced. 

The logical process by which one arrives at generalizations 
from a study of particular cases is termed induction, as 
opposed to deduction, which involves the drawing of special- 
ized conclusions from general propositions. By statistical 
induction or statistical inference is meant the generalization 
of statistical results, the application to a population of 
measurements derived from a sample. We are employing 
this procedure constantly in practical statistical work, 
though not always with a full realization of the assumptions 
inherent in that process and of the limitations attaching 
to it. 


The Natuke of Statistical Induction 

The problem at issue in considering the validity of 
statistical induction may be put in the following form: 
A statistical measurement — an average, a frequency ratio, 
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a coefficient of correlation — has been derived from the 
study of sample data drawn from a given population. 
(The term “ population” refers to a complete universe of 
things or phenomena having stated characteristics in com- 
mon.) May we assume that, if additional samples were 
taken from the same population, the corresponding measure- 
ments would have the same values? If not, may we deter- 
mine the approximate limits to the fluctuations to be 
expected in these measures, as derived from successive 
samples? Here, obviously, is a problem of supreme impor- 
tance. Karl Pearson has called it “the fundamental problem 
of practical statistics.” If we cannot be assured of a certain 
degree of stability in the results secured from successive 
samples it would be quite invalid to generalize from the 
examination of a limited number of cases. No weight 
would attach to any study except one covering the entire 
universe of things or phenomena composing the given 
population. Yet such all-inclusive studies of economic 
phenomena are practically impossible. Index numbers of 
prices, of wages, of living costs, equations describing the 
relation between the production and prices of given com- 
modities, coefficients of correlation between temperature 
and crop yield — all must of necessity be based on the 
study of samples. The problem of statistical inference, 
in the words of Oskar Anderson, is that of so utilizing the 
samples as to arrive at the best possible approximation 
to the characteristics of the universe. 

We have noted that statistical inference is a special 
form of a general process of reasoning, induction. Two 
points are to be emphasized concerning inductive reasoning. 
First, the conclusion of any induction holds only in terms 
of probabilities. For such a conclusion, by the very defi- 
nition of an induction, applies to cases not included in 
the observations. As opposed to deductive reasoning, in 
which the conclusion is implicit in the premises, induction 
yields a conclusion going beyond the premises. When all 
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the cases to be covered by the conclusions are included in 
the observations, the conclusion ceases to be an induction 
and becomes a descriptive statement. Accordingly, although 
induction is a highly fruitful means of adding to human 
knowledge, it is always hazardous. A leap in the dark is 
always involved, when we apply conclusions to cases not 
yet observed. 

The justification for this leap in the dark, and this is 
the second point we wish to stress, is found in an assumption 
that there is a “limitation to the amount of independent 
variety” found in nature. While there is variation in 
nature, the degree of such variation is limited; there is 
some uniformity in all natural processes. When we are 
dealing with quantitative data this uniformity in nature is 
found in the stability of large numbers, as exemplified by 
the curious regularities in such phenomena as birth rates 
or death rates. Nature, in other words, is not marked by 
utter chaos; principles of regularity, order and stability 
appear in all natural processes, and these principles are 
strongly evident when we deal with masses of quantitative 
data. Therefore, when we generalize such a measure as 
an index number of wholesale prices, we do so on some such 
assumption as this: It is reasonable to suppose that, in 
the larger population to which this result is to be applied, 
there exists a uniformity with respect to the characteristic 
or relation we have measured. As a result of this uniformity 
we should expect statistical measurements derived from 
successive samples drawn from this population to fluctuate 
within definite limits. 

It is evident that in making this assumption, in saying 
“It is reasonable to suppose . . . ,” we are introducing 
an hypothesis which is incapable of complete verification 
by purely statistical methods. There is, thus, in every 
statistical induction, an a priori element. The statistical 
conclusion can never stand completely on its own feet. 
It must be endorsed by reason and judgment if it is to 
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carry conviction. If a high positive coefficient of correlation 
were secured from the study of a sample relating to banana 
importations and sales of new life insurance, this would 
not fur nish convincing evidence of a causal relation, or a 
relation of contingency, between these two variables. There 
is no reasonable basis for assuming that, in the larger 
uni verse of phenomena from which the sample was drawn, 
there would be uniformity with respect to this relation- 
ship. 

Statistical inference differs from the general process of 
induction in that a quantitative result is generalized. We 
seek to apply to a larger group — the population — the 
value of mean, standard deviation, or coefficient of correla- 
tion that has been computed from a sample. The measure- 
ment secured from the sample is an estimate of the corre- 
sponding measurement relating to the population. The 
direct task faced in such generalization is that of determining 
the limits within which these estimates would probably 
fluctuate, if based upon a number of different samples 
drawn from the same population. A number defining these 
limits will serve as a measure of the reliability of the given 
results, when generalized to apply to the population. 

We should make clear at this point the sense in which 
the term “population” is used. When we speak of a popula- 
tion we are referring to an aggregate, whether of persons, 
things, or measurements, having certain common charac- 
teristics, or generated by a given system of causes. The 
term may refer to a hypothetical population from which 
a given sample may or may not have been drawn, or to a 
parent population of which a given sample is assumed to 
be representative. It may be a population of prices, or a 
population of cephalic indices; the term is not restricted 
to a population of persons. R. A. Fisher speaks of a “popula- 
tion of possibilities,” referring to the possible results of 
an experiment many times repeated. Of high importance 
in statistics are populations of statistical measurements — 
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means, coefficients of variation, standard deviations, etc. 
It is proper to note that the populations to which most 
statistical results apply are infinite in size. Statistical 
generalizations relate to hypothetical universes containing 
infinite numbers of units. We assume a sample to be 
drawn, not from the finite population that might be covered 
by actual enumeration, but from the infinite population, 
or universe, that would be generated if the forces or system 
of causes that brought this sample into being were to 
operate without limit. (Statisticians have given some atten- 
tion to special techniques, appropriate for dealing with a 
finite universe, but problems with which we do not here 
deal are faced in such applications.) 

The principle of the uniformity of nature is assumed, o* 
course, to apply to the universes from which our samples 
are drawn, if these samples are to be made bases of inductive 
generalizations. We must assume that these universes are 
stable, and that all their attributes are stable. An attribute 
of such a stable universe may not be exactly determined 
from the attribute of a single sample, but measurements 
defining the attributes of numerous samples drawn from 
the same universe will be distributed about the true value 
(i.e., that of the universe) in a systematic fashion. Each 
sample value is, of course, an estimate of the true value 
of the corresponding attribute of the population at large . 1 
The precise determination of the characteristics of this 
distribution of estimates is essential to the determination 
of their reliability. 

Having knowledge of this distribution we may deter- 
mine the limits within which estimates derived from dif- 
ferent samples of the same population may be expected 
to fluctuate. A measure of these limits will serve as a 

1 By convention, not yet generally adopted, but useful, the attribute of 
the population which is being estimated is termed a parameter , while the esti- 
mate of it is termed a statistic. Our certain knowledge is limited to statistics. 
We use this knowledge to the best of our ability to provide us with approx* 
imations to the true parameters which we can never know. 
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measure of the reliability of the given results, when gen- 
eralized. Such a measure might be secured by the labori- 
ous process of studying a great many different samples, 
just as the dice were thrown 4,096 times in a preceding 
example. Thus we might desire to test the reliability of an 
average of weekly earnings of a certain class of workers. 
A first average might be secured from a sample composed 
of 250 individual records. This result might be tested by 
computing 499 additional averages, each based on 250 
individual records. These 500 averages would not be 
identical in value, but if they were tabulated a frequency 
distribution closely approximating the normal type would 
be secured. From this distribution we might compute the 
mean of all the averages and the standard deviation of 
these averages. This standard deviation would serve as a 
measure of the variation found in the averages of weekly 
earnings, as computed from successive samples. 

We have noted at an earlier point that a Gaussian or 
normal distribution is generated when three general condi- 
tions prevail. These are: 

a multiplicity of forces affecting each observation 

independence of the various forces affecting each observa- 
tion 

equality of the forces tending to generate values above 
and below the mean value. 

The process of random sampling which would, presumably, 
be employed in securing the successive samples referred to 
in the preceding paragraph should satisfy the conditions 
giving rise to the normal distribution. There should be no 
special or imbalanced influences affecting particular samples. 
The differences between successive samples should be such 
as arise from a combination of forces, intermingled and 
not open to separate definition; that is, from “the mass of 
floating causes known as chance.” If these conditions be 
met, and if the field of observation (i.e, the universe being 
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sampled) be homogeneous, the distribution of means com- 
puted from the successive samples would be normal. 

This is a fact of high importance to statistical inference. 
In the realm of original observations, relating to persons, 
things, or events, normal distributions are the exception, 
rather than the rule. But the measurements which the 
statistician derives from successive samples, and which he 
employs in the inductive reasoning by which he generalizes 
his results, are far more frequently distributed in accordance 
with the Gaussian law. Much of the power of statistical 
instruments derives from this fact. 

The statistical investigator is rarely in a position to 
build up a frequency distribution of constants derived from 
numerous samples. It is generally impossible to take 400 
or 500 successive samples, in testing the reliability of a 
given measurement. As a practicable alternative a process 
of mathematical deduction is employed, in determining the 
characteristics of distributions of statistical measurements 
derived from random samples, drawn under stated conditions 
from given populations. An example of such mathematical 
deduction is provided by the derivation of the mean and 
standard deviation of a distribution generated under the 
following conditions: 

p, the probability of a given event occurring, is known 

q, the probability of the event not occurring, is known 

n, the number of independent events in a single trial, is 

known. 

Under these conditions, as was noted in the preceding 
chapter, M - np, and a = Vnpq - 1 By a somewhat similar 
chain of reasoning, we may determine the characteristics 
of a distribution composed of arithmetic means of a number 
of samples of constant size drawn from a given population. 
The standard deviation of such a distribution is given by 

aM= VN 


1 For proofs, see Appendix B 
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where (r M is the required measure, <j is the standard deviation 
of the population from which the samples have been drawn, 
and N is the number of observations in each of the samples. 1 

The determination by deduction of the characteristics 
of distributions of statistical constants derived from samples 
is fundamental to the whole process of statistical inference 
It is not, of course, a task that needs to be done afresh in 
each statistical investigation. When the law of distribution 
of a given class of statistical measurements has been deter- 
mined, statisticians may utilize the results in their various 
research fields, with due regard to all the conditions under 
which the given law holds This basic task has been per- 
formed for most of the statistical measurements currently 
employed. Earlier approximations have been refined in 
recent years for many classes of statistical measurements. 
The statistician today may draw upon a considerable body 
of tested and verified materials in determining the relia- 
bility of various kinds of statistical estimates. These 
materials exist in the form of shorthand expressions for 
the standard errors of different statistical constants, and 
in prepared tables for use when the distributions deviate 
materially from the type defined by the normal law of error. 

Practical Problems of Sampling 

The preceding discussion has dealt with one aspect of 
statistical induction. The argument has proceeded on the 
assumption that inferences concerning the attributes of 
a population would be based upon a sample thoroughly 
representative of the universe from which it was drawn. 
The securing of such a sample is a first condition of valid 
statistical induction. Practical problems of the first impor- 
tance are faced in the actual field work of sampling. The 
procedures employed in such field work lie, in the main, 
beyond the scope of the present book, but it is desirable 


1 For proof see Appendix C. 
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that the general nature of sampling techniques be indicated. 
References given at the end of the chapter deal in greater 
detail with these procedures. 

The task of securing an adequate sample calls, on the 
negative side, for an avoidance of bias in the individual 
observations and of preventable errors in schedules and 
tabulations. The term bias is applied to observational 
errors that are cumulative and non-compensating. Personal 
prejudices on the part of reporters, mental attitudes of 
which the subjects may be unconscious, or the mere physical 
conditions of observation may lead to persistent errors 
that distort samples. Errors in recording and tabulation 
are easier to detect. Training of enumerators and careful 
editing of schedules and tables will keep such errors to a 
minimum. 

On the positive side sampling technique is directed toward 
the securing of a sample that is truly representative of the 
universe of inquiry. This is a major task, calling for a 
high degree of care and judgment in planning field opera- 
tions conforming to the ultimate objectives of the study. 
A. L. Bowley has classified, under the four heads distin- 
guished below, methods suitable for use in securing a 
representative sample. 

The method of random selection is employed when the 
entire population to be sampled is treated as a whole, and 
members of the sample are so chosen as to be random 
members of that population. In this selection the indi- 
vidual choices must be independent of one another, and 
the chance of any member of the entire population being 
included in the sample must be the same as that of every 
other member. As regards the conditions of selection there 
should be present no element of preference or bias that 
would tend toward the inclusion or exclusion of certain 
members of the larger group. The general requirement 
here laid down should be interpreted, as J. M. Keynes 
has pointed out, to mean that with respect to the purpose 
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of the particular investigation the members of the sample 
should be random members of the population at large. 
Intelligent planning is needed in securing a purely random 
sample. The obvious procedure of picking the most readily 
available cases would by no means meet the condition of 
random selection. Certain important elements in the uni- 
verse of facts to which the conclusions are to be applied 
may be excluded through the play of an unconscious bias 
unless careful attention is given to the selection of cases. 

The population fro% which a given sample is to be selected 
is often not homogeneous, with reference to the pin-pose of 
particular investigation. OJ Slum districts and wealthy districts 
may both have to be covered, in a study of social or eco- 
nomic conditions. Agricultural districts differing mate- 
rially in fertility may be included in a farm survey. If, 
by a process of stratification, the universe of inquiry 
may be broken into sub-groups individually more homo- 
geneous than the total population, the reliability of sampling 
results may be substantially increased. Within each sub- 
group random selection may be employed. This method 
is termed stratified random selection. The size of each group 
in the sample should be proportionate to the relative 
importance in the total population of the stratum repre- 
sented by that group. Where homogeneous sub-groups are 
secured by the process of stratification, and where the 
differences between the sub-groups are pronounced, this 
method is distinctly superior to that of random selection 
among the undifferentiated members of the population at 
large. 

In using the third method, that of purposive selection, the 
statistician seeks to secure a sample having the same 
characteristics as the universe of inquiry in respect of one 
or more “control” factors. If these controls are highly 
correlated with the quantities that are the objects of investi- 
gation, this method of selection gives obvious support to 
generalizations based on the study of the sample. As in 
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stratified selection, sub-groups are employed. These sub- 
groups are chosen not at random, but in such a way as to 
possess, in the aggregate, the same attributes (e.g., means, 
standard deviations) as the population at large, in respect of 
the control factors. Deliberate manipulation, often through 
a process of trial and error, is necessary to effect this agree- 
ment between the sample and the totality. 

When this method is employed the statistician must, 
of course, have information concerning the “controls” for 
the total population. The application of the method is 
restricted to fields in which such knowledge is available. 
Census type inquiries on population, agriculture, and manu- 
factures provide such basic knowledge. Promising work 
has been done in purposive selection in dealing with agricul- 
tural data. 

The fourth method, that of stratified purposive selection, 
represents a combination of the use of stratification to 
secure homogeneous sub-groups and of deliberate selection 
through the use of controls. Where data are open to such 
stratification, and where necessary controls are available, 
the combined procedures may profitably be employed. 

When a representative sample has been seemed, when 
errors and bias have been avoided, we may still expect the 
attributes of the sample to differ from those of the total 
population. The effects of fluctuations of sampling will 
still be present, so long as the coverage of the sample falls 
short of the universe of inquiry. We may only estimate the 
attributes of the population; we still face the uncertainties 
that inhere in induction. It is possible, however, to define 
with considerable precision the probabilities involved in 
statistical induction when the differences between the 
attributes of the sample and those of the total population 
are due to fluctuations of “simple sampling,” that is, to 
the scrambled mass of causes that constitutes chance. 
Under these conditions it is possible to assign in advance 
limits within which we may expect statistical measures 
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derived from different samples of the same population to 
fluctuate. This means that we may apply to the population 
at large statistical measures secured from the study of a 
sample, not with confidence in their perfect stability, but 
with fairly definite knowledge of the margin of error involved 
in thus extending our results. Where the necessary condi- 
tions are fulfilled statistical induction is a valid procedure. 

Use of Measures of Reliability 

Measurements defining the sampling errors to which given 
statistical constants are subject are put to various uses. 
It is in order now briefly to review the standard errors of 
different statistical measurements, and to illustrate their 
applications. 


SAMPLING ERRORS: THE MEAN 

For the standard error of an arithmetic mean we have 

<7 

CT m = — — 

VN 

where the symbol a in the numerator of the right-hand term 
refers to the standard deviation of the population from 
which the sample is drawn and N is the number of observa- 
tions in the sample. Actually, of course, we do not know 
the standard deviation of the population, but we use as 
an approximation to it the standard deviation of the 
sample. The approximation is acceptable except when the 
number of observations in the sample is small, in which 
case special treatment is needed. 1 

Reference has been made above to the fact that a dis- 
tribution of arithmetic means computed from random 
samples of a given population usually follows the normal 
law of error. This is true even though the distribution 
of the population from which the samples are drawn is 
not itself normal. Accordingly, we may interpret given 

1 See Chapter XVIII. 
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values of <tm with reference to the probabilities associated 
with deviations in a normal distribution. 

Table 34 in Chapter V shows the distribution in 1933 
of 11,404 workers in open hearth steel furnaces, classified 
according to their average hourly earnings. The arithmetic 
mean of this distribution is 50.14 cents; the standard 
deviation, which we may here represent by s, is 18 . 685 cents. 
Accepting this standard deviation as an approximation to 
the standard deviation of the population from which this 
sample was drawn, 1 we have 


0M = 


s 

V 'N - 1 


18.68 5 

vTl'403 


.175. 


The true mean of the hourly earnings of wage workers 
in open hearth furnaces in 1933 is not known. The figure 
50 . 14 cents is our best approximation to it. If we should 
draw many samples, each the size of the one we have here, 
we should have many mean values normally distributed 
and centering, we may assume, at the true value. The 
standard deviation of this normal distribution we estimate 


1 The formula for the standard error of the mean, when the a of the popula- 
tion is known, is given by cm = — ?=• When the standard deviation of the 

y/N 

population is replaced by that of the sample (a), as an approximation to the 
desired quantity, the formula for cm may be written 

s s 


<*m = — — » or cm — — 

VN VN - 1 

The fir st of t hese is appropriate if s has been derived from the relation 
s = ^ i ( w ^ ere & is tlm deviation of a single observation from the 

mean); the second is appropriate if s has been derived from the relation 

/ 2d 2 

In other words, N should be reduced by 1 either in the derivation 

of a or in the derivation of c M - If <tm is derived from the d? s of the original data, 
the single operation is summed up in Bessel's formula 


_ _ ,/ 2d 2 

M V N(N-1) 

(See Whittaker and Robinson, Calculus of Observations , London, Blackie & 
Son, 1924, 205-206.) The reason for the reduction of N is discussed in Chap- 
ters XV and XVIII, in dealing with 11 degrees of freedom.” 
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as .175 cents. Knowledge of this standard deviation, or 
standard error, enables us to set limits within which it is 
highly likely that the true mean lies. Any statements we 
make about the true mean are to be interpreted with 
reference to this figure. 

We might, for example, on the basis of these results, 
make the flat statement: The true mean of the population 
lies between 49.965 cents and 50.315 cents. (The first of 
these limits is the sample mean minus one standard error; 
the second is the sample mean plus one standard error.) 
We may not assert that this statement is certainly true. 
It may be true or false. But if we continue indefinitely to 
draw samples from the population in question, computing 
the mean of each and the standard error of that mean, and 
if we make a statement about each similar to that made 
above, 68 out of 100 such statements will be true. (The 
actual numerical limits set by the different statements will 
differ, of course.) 

It is possible to vary the statement according to the 
degree of probability we wish to work with. Thus we might 
say: The true mean of the population lies between 49.80 
cents and 50.48 cents. 1 Of an indefinitely large number 
of such statements, each based on the study of a sample 
similar to the one before us, we know that 95 out of 100 
would be true. This is the kind of knowledge we have 
about generalizations based on results obtained from samples. 

The essential facts concerning the mean of the present 
sample and its reliability may be summarized in the state- 
ment: The mean hourly earnings of wage workers in open 
hearth furnaces in 1933 was (in cents) 50.14 ± . 175. 2 

1 49.80 = 50.14 - (1.96 X .175) 

50.48 = 50.14 + (1.96 X .175) 

Ninety-five per cent of the area under a normal curve is included within 
M ± 1 . 96<r. 

2 The measure of sampling reliability here given is the standard* error. The 
traditional usage, now less commonly followed, has been to give the probable 
error, which is .6745 times the standard error. In the present example the 
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The standard error of a mean is frequently used, not 
merely as an abstract measure of sampling reliability, but 
as an instrument for testing a given hypothesis. Such 
an hypothesis usually involves an assumed parent popula- 
tion, and the test centers about the question whether a 
given sample could have been drawn from this parent 
population. Let us assume that, on rational grounds, we 
have set up the hypothesis that the mean duration of 
business cycles is five years. We have observations relating 
to 77 cycles occurring in various countries during stages 
of rapid industrialization. 1 These cycles are distributed, in 
respect of duration, as follows: 


Duration of cycles, 
in years 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


Number of 
cycles 

3 

10 

22 

15 

12 

8 

2 

2 

2 

_ 1 _ 

77 


The mean duration of these 77 cycles is 4.09 years, and 
the standard deviation of the distribution is 1.88 years. 
For the standard error of the mean we have 


1.88 


.216. 


Are these results consistent with the hypothesis tha t our 
sample of 77 cycles is drawn from a parent population 

probable error of the mean is .118 cents. It is well, in any case, to specify 
the exact measure of reliability being used. 

1 Cf. W. C. Mitchell, Business Cycles , The Problem and Its Setting, New 
York, National Bureau of Business Research, 1927, 412-416; F. C. Mills, 
“An Hypothesis Concerning the Duration of Business Cycles,” Journal oj 
the American Statistical Association, December, 1926, Vol. 21, 447457. 
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(i.e., a universe of cycles generated under similar conditions) 
with a mean duration of five years? 

If we use M to represent the mean of the sample data, 
Mh to represent the hypothetical mean of the universe, 
and T to denote the deviation of our sample mean from 
the hypothetical mean, expressed in units of the standard 
error of the mean, we may write 

„ M - M h 4.09 — 5.00 .91 _ 

<x M .216 .216 ‘ 

The figure . 216 is, according to our hypothesis, the standard 
deviation of a distribution of arithmetic averages the mean 
value of which is 5.00. If we were drawing from such a 
distribution, the mean of our present sample would represent 
a departure of 4.21 standard deviations from the general 
mean. What is the probability of such a departure occurring 
merely as a result of chance? Consulting a table showing 
areas under the normal curve, we find that the area on 
one side of the mean, lying at a distance of 4.21 standard 
deviations or more from the mean, constitutes 1/100,000 
of the total area under the curve. In terms of probabilities, 
this means that there is only one chance in 100,000 that 
a member of the population represented by the normal 
curve will fall below the mean value by 4.21 standard 
deviations or more. This chance is so remote that we say 
the event in question could not occur. With reference to 
the present problem, we conclude that the results are not 
consistent with the hypothesis. We could not have secured 
the sample values in question had we been drawing from 
a universe of cycles with a mean duration of five years. 
The results fail to confirm the theory we have set up. 

The probability cited (1/100,000) relates to a deviation 
on one side of the hypothetical value only. If we wish to 
define the probability of an observation departing from the 
hypothetical mean value 5.00 by 4.21 standard deviations 
or more, without reference to whether the departure be 
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above or below the hypothetical value, we must double 
the above probability. The chance of such a departure in 
one or the other direction is 2/100,000. Tests of hypotheses 
usually take this latter form. It is customary to ask whether 
a deviation of a stated magnitude could occur, and to 
measure the probabilities involved with reference to devia- 
tions in both directions. 

In using tables of the normal probability integral in 
tests of this type we are generally concerned with the 
probability of occurrence of deviations as great as or greater 
than some stated value (in the above example, .91 years, 
or 4.21 standard deviations). This probability is repre- 
sented by areas in the two tails of a normal curve (assuming 
that deviations either above or below the mean are in 
question). The inside limits of these segments are set 
by ordinates erected at distances from the mean equal 
to the deviation in question; the outside limits are at 
infinity. (See Fig. 85, in Chapter XIII, for a graphic 
representation of segments lying beyond stated limits.) The 
usual tables of the probability integral define the areas 
falling within limits set by ordinates at specific points. 
Our concern is with areas beyond these ordinates. Sub- 
traction of the internal area from the whole area (unity) 
will, of course, give the area of the external portion defining 
the probability that is here desired. 1 

If we should be testing the hypothesis that the mean 
duration of business cycles is four years, we derive the 
value of T as follows: 


T = 


4.09 - 4.00 
.216 


.42. 


From the tabulated values of areas under the normal curve 


1 See W. Edwards Deming and Raymond T. Birge, “On the Statistical 
Theory of Errors,” Reviews of Modern Physics , Vol. 6, July, 1934, 133ff., for a 
discussion of this probability, which they designate P M , and tests based on it. 
(In their terminology, u is the difference between the mean of the sample 
and the mean of the assumed population.) This article includes a chart (134) 
for use in determining the significance of a given deviation. 
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we determine that approximately 67 per cent of all the 
observations in a normal distribution will deviate from 
the mean value by .42 standard deviations or more. We 
interpret this to mean that if our sample of 77 observations 
were drawn from a universe with a mean value of 4.00 
years, the chances are 67 out of 100 that the mean of the 
sample would depart from the population mean by . 09 years 
or more. (We have counted the combined probabilities 
of deviations above and below the population mean.) In 
other words, a deviation as great as the one we have experi- 
enced is highly probable. The results are not inconsistent 
with the hypothesis that the mean duration of business 
cycles is 4.00 years. They do not, be it noted, prove the 
hypothesis. All that we may say of statistical evidence, 
on the positive side, is that it is not inconsistent with a 
given hypothesis. Supporting statistical evidence strengthens 
our confidence in the hypothesis, of course. Its tenability 
must be determined on the basis of rational considerations 
as well as empirical evidence. 

This last point deserves emphasis. “The significance of 
each test,” say Deming and Birge, 1 “depends not 'only on 
the value of P (i.e., the measure of probability appropriate 
to the test) that is found, but also on how much is known 
a priori regarding the parent population.” The above 
hypothesis of a four-year cycle has no particular rational 
basis (the figure was used here, of course, to exemplify 
a procedure). The fact that the observed results are not 
inconsistent with it is significant in a negative way, but 
does not establish the truth of the hypothesis. Low values 
of P, indicating that the facts are inconsistent with given 
hypotheses, are highly useful in leading us to reject tentative 
formulations of theory. Acceptable values of P, however, 
need the support of other knowledge (a priori and empirical) 
concerning the body of materials being studied and the 
regularities prevailing therein. Within the limit of acceptable 

1 Loc. tit., 137 - 
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values, indeed, we may accept one hypothesis, rather than 
another for which empirical tests yield a higher value of 
P, because the former is more consistent with the general 
body of existing knowledge concerning the field in ques- 
tion. 

In the two tests we have applied, no difficulty was encoun- 
tered in interpreting the probabilities bearing on the relation 
between the hypothetical mean and the observed facts. 
In the one case the odds were so small as to leave no doubt 
as to the lack of agreement; in the other case the difference 
was clearly insignificant. But many tests will lie on the 
borderline, and we must have some reasonable criterion 
as to the limit of significance. Odds of 1 out of 100 constitute 
one conventional standard. If a given difference between 
hypothetical and observed values would occur as a result 
of chance only 1 time out of 100, or less frequently, we may 
say that the difference is significant. This means that the 
results are not consistent with the hypothesis we have 
set up. If the discrepancy between theory and observation 
might occur more frequently than 1 time out of 100 solely 
because of the play of chance, we may say that the difference 
is not clearly significant. The results are not inconsistent 
with the hypothesis. The value of T (the difference between 
the hypothetical value and the observed mean, in units 
of the standard error of the mean) corresponding to a 
probability of 1/100 is 2.576. One hundredth part of the 
area under the normal curve lies at a distance from the 
mean, on the x-axis, of 2 . 576 standard deviations or more. 
Accordingly, tests of significance may be applied with direct 
reference to T, interpreted as a normal deviate (i.e., as a 
deviation from the mean of a normal distribution expressed 
in units of the standard deviation). A value for T of 2.576 
or more indicates a significant difference, while a value of 
less than 2 . 576 indicates that the results are not inconsistent 
with the hypothesis in question. 

There is, of course, nothing rigid about this particular 
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standard. Some statistical workers employ odds of 1 out 
of 20 as a limit, rather than 1 out of 100. With this standard 
we would accept as significant (i.e., not due to chance) a 
difference between hypothetical and observed values that 
would occur only 5 times out of 100, or less frequently, as 
a result of random fluctuations of sampling. The value 
of T corresponding to this standard is 1.96. The standards 
of significance actually employed by a research worker may 
well vary from problem to problem. The investigator uses 
the results of these tests of significance as aids in the inter- 
pretation of his results and in the development of a body of 
theory that is not inconsistent with the evidence provided 
by experience. In the interplay of deduction and induction 
that marks such a process, no single absolute standard 
for the rejection or acceptance of hypotheses would be 
appropriate. 

The formula for the standard error of a mean, as given 
above, relates to a sample chosen by random selection. 
For a proportionately stratified sample the standard error 
of the mean, <r m ,, may be derived from the relation 


where <r 0 is the standard error of the same mean as it would 
have been had the N observations been taken at random 
from the universe of inquiry, and a m is the standard deviation 
of the averages of the several strata about the average 
of the whole sample. 1 In computing <r m the deviation of 
the mean of each stratum is weighted in proportion to the 
number of cases in that stratum. N is the total number 
of observations in the sample. It is clear from the formula 
that the standard error of the mean of the stratified sample 
is smaller than the standard error of a corresponding random 
sample. 

1 The above formula is from A. L. Bowley, Elements of Statistics , London, 
King, sixth ed., 1937- 
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SAMPLING ERRORS: MEDIAN AND QUARTILE8 

The median is subject to greater sampling fluctuations 
than is the mean. The degree of dispersion of median 
values derived from a number of samples of a stated size 
from a given population will be approximately 25 per cent 
greater than the dispersion of the arithmetic means of 
the same samples. More exactly, we have 

<j Mi = 1 . 25331 ^===* 

Estimates of the quartiles, in turn, are less accurate than 
are estimates of the median. For these we have 

ff Qi — °"03 = 1 • 36263 -^= ==• 


SAMPLING ERRORS: STANDARD DEVIATION 


In determining the magnitude of the sampling errors to 
which the standard deviation is subject we must distinguish 
between samples drawn from a normally distributed universe 
and those derived in the more general case, in which the na- 
ture of the distribution of the universe is unknown. If the 
distribution of the universe is normal we have, as the esti- 
mated standard error of cr, 


<Tc = 


s 

V 2 N 


(where N— 1 has been used in the computation of s). Thus, 
for the universe of residential telephone subscribers repre- 
sented by the distribution in Table 109, we have 


147.7 

VL990 


3 . 31 . 


The more general formula for the standard error of the 
standard deviation involves the fourth as well as the second 
moment of the distribution: 


a. 



U 4 — u 2 2 

4u2 • N 
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For the distribution based on hourly earnings in open-hearth 
steel furnaces in 1933 the standard deviation was 18.685 
cents (see Table 34). As the standard error of this measure- 
ment we have 1 


<r« 



1,384.1183 - (13. 9674) 2 
4 X 13.9674 X 11,404 


= .0432. 


Since the moments here employed are in class-interval 
units, the derived measurement is also in those terms. In 
the original units we have 


<r a = . 0432 X 5 cents = . 2160 cents. 

Many tests of significance involve the use of standard 
deviations and corresponding measurements of sampling 
reliability. These are discussed more fully in the chapter 
on the analysis of variance. 


SAMPLING ERRORS: COEFFICIENT OF CORRELATION 

A number of distinctive problems are faced in generalizing 
the results of correlation studies and in determining the 
significance of the measurements secured in such studies. 
Certain of these problems are discussed in the succeeding 
chapter, and Chapter XVIII deals with important limita- 
tions that are faced when the samples employed are small. 
At this point general methods of measuring the reliability 
of correlation measurements are presented, without certain 
of the qualifications that will be discussed later. 

As a basic formula for the sampling error of the coefficient 
of correlation computed from N pairs of observations, we 
have 

1 — - r 2 

<Tr = vW^ = t 

where r, in the numerator of the right-hand member, is 
the true coefficient of correlation in the population at large. 
Since we do not know the true r we must use the r of the 

1 Since Sheppard’s corrections are not appropriate to this distribution, the 
uncorrected moments are used. 
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sample as an estimate of the required value. This formula 
may be taken to hold for distributions approaching the 
normal type, when the number of cases included in the 
sample is fairly large — say 50 or more. When the sample 
is small and, particularly, when we are dealing with a 
relatively high coefficient of correlation derived from a 
small sample, the standard error secured from the formula 
cited above may be faulty, and tests of significance based 
on it misleading. The reason for this and means of meeting 
the difficulty are discussed in Chapter XVIII. 

In exemplifying the application of the usual test, we 
may employ results presented in Chapter X, on the relation 
between the discount rates of Federal Reserve banks and 
of commercial banks. The value of r is + .84, while N 
equals 1,800. Accordingly, we have 


1 ~ (.84)° 
” Vl,800 - 1 


.294 4 

42.40 


.007. 


The standard error of r is frequently used, as are similar 
measurements relating to other statistical constants, to test 
hypotheses. We may put such a question as the following: 
Is the value of r secured from a given sample significant 
of a real relationship between the variables in question 
in the population from which the sample was drawn? 
Putting the question in form more appropriate for testing: 
Is the present value of r consistent with the hypothesis 
that there is no relationship between the variables in 
question in the population at large? R. A. Fisher terms 
such an hypothesis a “null hypothesis.” The purpose of 
experiment, in his words, is to give the facts a chance of 
disproving the null hypothesis. 

In a study of the movements of commodity prices, 1,202 
measurements were secured on the timing of advances in 
the prices of individual commodities during periods of 
general business revival. Paired with each measurement 
was a similar observation on the timing of the decline in 



476 INDUCTION AND SAMPLING 


the price of the given commodity during the succeeding 
period of general business recession. 1 We desire to know 
whether there is any relation between the sequence of price 
revival and the sequence of price recession. Is there a 
pattern in price movements during business cycles? Evidence 
of the existence of such a persistent pattern would lend 
support to the view that cycles represent true regularities 
in economic life. 

These 1,202 pairs of observations yield a correlation 
coefficient of + .27. This does now show a pronounced 
degree of relationship. Our chief concern, however, is not 
with the magnitude of r. We wish to know whether the 
result is consistent with the hypothesis that the true corre- 
lation is zero. For the standard error or r we have 


<j r = 


1 


Vl,2d2 - 1 


= .029. 


By hypothesis, the population value of r is zero, so the 
numerator of the fraction is 1. 

If the true value of r were zero, and the standard error 
of r were . 029, what would the probability be that, as a re- 
sult of chance, we should secure a coefficient of + . 27 from 
a given sample? Since this value represents a departure 
of more than 9 a ’ s from the hypothetical value of zero, 
the probability that the difference is due to chance is 
infinitely small. We conclude that the results are not 
consistent with the hypothesis that the sequence of price 
change during revival is unrelated to the sequence of decline 
in a succeeding recession. The null hypothesis is disproved. 

T ^in this case T = - - ^ 

than 2 . 576 the conclusion would of course have been differ- 
ent. In such a case the discrepancy between the sample r 
and the hypothetical value of zero could be attributed to 


Had the value of 


been less 


1 The Behamor of Prices , New York, National Bureau of Economic Research, 
1927 , 131 . 
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sampling fluctuations. The result would not be inconsistent 
with the null hypothesis. 

Having established that the results are not consistent 
with the hypothesis that the true value of r is zero, we may 
compute the standard error of r as actually derived. Assum- 
ing now that the sample is drawn from a parent population 
in which r = + .27, we have 


cv 


1 (~ 27 ) 2 - = .027. 

a/1,202 - 1 


SAMPLING ERRORS.’ INDEX OF CORRELATION 

The standard error of the index of correlation may be 
approximated from the relation 

„ 1 - P 2 . 

P V'N^m 


In this formula m represents the number of constants in 
the equation of regression. In the example cited in Chapter 
XII, relating to alfalfa yield and depth of irrigation water, 
p is .80, A is 44, and m has a value of 3. We have, thus 


- (.80) ! 
V44 - 3 


The use and interpretation of this measure are analogous 
to those of <r T . In the present instance the index of correlation 
is clearly significant. 1 


SAMPLING ERRORS : THE TEST FOR LINEARITY 

As a test for linearity we have been given 

f = r\ 2 — r 2 . 

But we wish to know whether, in a given case, the difference 

1 See Ezekiel, M., Methods of Correlation Analysis, N. Y., John Wiley and 
Sons, 1930, 257-258, for a discussion of the sampling reliability of the index of 
correlation. 
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between 17 2 and r 2 may be due merely to a chance fluctuation 
of sampling, or to a real departure of the underlying rela- 
tionship from the linear form. As the standard error of f 
Blakeman has proposed 

* t = 24 /I V (1 - vr - (1 - r*y + 1 . 

The use of this measure may be illustrated with reference 
to the problem relating to wheat yield which was considered 
in an earlier chapter. For the relation between wheat 
yield and amount of nitrogen used as fertilizer, we had 

r = + .793 
77 = .965 
N = 193. 

(The uncorrected value of 77 should be used here.) 

Therefore 

f = 77 2 _ T-i = .302. 

Inserting the given values in the formula for and solving, 
we have 

<r f = .074. 

With f having a value of . 302, about 4 . 08 times its standard 
error, there can be no question as to the non-linearity of 
the relationship. The difference between tj 2 and r 2 is one 
which could hardly be due to chance fluctuations of sam- 
pling. 

The criterion 77 2 — r 2 is not very satisfactory as a test 
of linearity, since the distribution of f does not follow the 
normal law. The same weakness attaches to the correla- 
tion ratio. As Fisher has demonstrated, the distribution of 
77 does not tend to normality, even with large samples, 
unless the number of arrays is increased without limit. 
Accordingly, the standard error of 77 is of dubious utility. 
More efficient methods of testing for the existence of 
correlation, and for linearity, are discussed in Chapter XV. 
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SAMPLING ERRORS : COEFFICIENT OF RANK CORRELATION 


The standard error of the coefficient of rank correlation 
has been given by “Student” as 

ffpr = vf^T 

This formula holds on the assumption that the population 
value of the rank correlation is zero . 1 Under this condition 
the standard error may be taken to relate to a normal dis- 
tribution, and interpreted in the familiar manner, when N 
is fairly large, say 45-50 or more. For small samples the 
distribution of p r is not normal. In the example cited in 
Chapter X, dealing with the relation between the number 
of individual income tax returns and the number of passenger 
automobiles registered in 1934, by states, we had p r = . 94. 
Since there are 47 observations, the value of ov r is given by 


1 

V46 


.147. 


The sample is large enough to justify the assumption that 
the distribution of p, would approximate the normal type. 
The coefficient of rank correlation is clearly significant, 
being more than six times its standard error. 


SAMPLING ERRORS: COEFFICIENT OF REGRESSION 

High importance frequently attaches to the coefficient 
of regression, in dealing with relationships among variable 
quantities. For the standard error of this measurement we 
have 2 


_ h 

° b ~ VS? 

where x is a given value of the independent variable, 

1 See Hotelling and Pabst, loc. cit. 

2 See R. A. Fisher, Statistical Methods for Research Workers , Edinburgh, 
Oliver and Boyd, sixth edition, 1936, 134-146. 
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expressed as a deviation from the mean of that variable, 
and s„ is the root mean square of the deviations of the 
actual values of y, the dependent variable, from the corre- 
sponding computed values. That is, s u is a measure of the 
scatter about the line of regression. 1 

A test involving the use of <r b may be applied to data 
relating to the average corn yield per acre in Kansas, by 
years, from 1890 to 1933 (see Table 128, Chapter XVI). 
These yields show a fairly consistent declining trend. A 
line of trend fitted to the figures for these 44 years is defined 
by the equation 

Y = 22.05 - . 1074X 


where Y denotes corn yield per acre and X denotes time, 
in years, with origin at 1889. We wish to know whether 
the coefficient of regression (i.e., the slope of the line of 
trend) represents a significant departure from zero. The 
hypothesis we are testing is, then, that the true value of 
the coefficient of regression, in the population from which 
this sample is drawn, is zero — that there has been no 
significant decline in corn yield in Kansas over the. period 
in question. 2 

For Sy we secure the value 6.70, for VZx 2 the value 
84.23. Accordingly 

6 .70 n _._ 

= 803 = ° 795 - 

We may denote by the symbol /3 the coefficient of regression 


1 



2< y-yc)* 
N -2 


where y denotes a given value of the dependent variable and y c denotes the 
corresponding value derived from the equation of regression. In the computa- 
tion of Sy for this purpose N must be reduced by the number of constants in 
the equation of regression. 

2 The hypothetical population of which we assume our sample to be repre- 
sentative is the population that would be generated by the forces responsible 
for variation in Kansas com yields from 1890 to 1933, if those forces, un- 
changed, were to act upon an infinite number of cases. The application of 
this concept, and of the whole probability calculus, to data ordered in time 
involves some logical difficulties, which are discussed at a later point. 
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assumed in our hypothesis (in this case zero). We wish to 
know whether the deviation of our actual b from this 
hypothetical /3 may be attributed to chance, or whether 
it is too great to be so explained. This deviation should 
be expressed in units of the standard error of b, in order 
that the probabilities underlying the normal distribution 
may be applied in our reasoning. Using T, as before, to 
denote the deviation in units of a, we have 

T - LjzJ = H 1074-0 
<r b ' .0795 

= - 1 . 35 . 

The given value of b represents a departure of 1.35 
standard deviations from the mean value of zero in our 
hypothetical population. As may readily be determined by 
reference to the table of the probability integral, such a 
deviation might easily occur, as a result of chance alone. 
The results then, are not inconsistent with the hypothesis. 
There is no clear evidence here of a significant decline in 
corn yield per acre in Kansas during the period covered. 

SAMPLING ERRORS: DIFFERENCE BETWEEN MEANS 

A problem of sampling that arises rather frequently is 
that of determining whether two samples could have been 
drawn from the same parent population. Obviously, there 
would be some difference between the means of two samples 
from the same universe, as there would be between standard 
deviations or coefficients of correlation secured from different 
sampling operations. We may illustrate the procedure 
employed in determining the significance of a difference 
between two arithmetic means. 

Reference has been made above to a sample of 77 business 
cycles, occurring during stages of rapid industrialization. 
Their mean duration was 4.09 years; the standard deviation 
of the distribution was 1 . 88 years. The same investigation 
indicated that the mean duration of 51 business cycles 
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occurring in various economies during early stages of 
industrialization was 5.86 years, and that the standard 
deviation of these measurements was 2.41 years. There is 
an indication here that business cycles are accelerated, 
that their average length is shortened, when an economy 
is passing through a phase of rapid industrialization with 
corresponding impetus to technological change. In this 
case the null hypothesis against which we set our facts 
is that there is no difference, in respect of duration, between 
business cycles occurring in the two stages of industrialization 
named. 

The difference between two means is a statistical meas- 
urement subject to a definite law of distribution. If a 
great many pairs of samples were drawn from a given 
population, the value D (i.e., M x — M 2 ) could be computed 
from the two means of each pair. A frequency distribution 
of the D’s thus secured would follow the normal law. The 
magnitude of the standard deviation of this distribution 
would be a function of the sizes of the samples thus paired 
and of the standard deviations of these samples. We may 
approximate the standard deviation of this distribution of 
D’s from the relation 



or from 

The measurement needed for testing the hypothesis now 
before us is computed from the relation 

= . /W , (2. 41) 2 
y 76 + SO 

= V. 1627 

= .4034. 

The value of D , the difference between the two means, is 
5.86 — 4.09, or 1.77. This value of D is to be judged 
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with reference to a hypothetical value of zero. Accordingly, 
for T (the discrepancy expressed as a normal deviate) we 
have 


1 . 77-0 
4034 


This discrepancy far exceeds the magnitude 2.576, corre- 
sponding to odds of 1 out of 100. If the true value of D 
were zero, a discrepancy as great as this or greater would 
occur as a result of chance about 1 time out of 100,000 
trials. The results indicate that the difference between 
the two means is not due to chance. The facts are not 
consistent with the hypothesis that the two samples are 
drawn from the same population. There is a significant 
difference between the average durations of business cycles 
occurring in early stages of industrialization and in later 
stages of rapid industrial change. 


SAMPLING ERRORS : DIFFERENCE BETWEEN PERCENTAGES 

There are occasions when it is desirable to determine 
whether a difference between two proportions (or percent- 
ages) is significant. Using D p to denote such a difference, 
we have 



where p 0 is the weighted mean proportion, q 0 is 1 — p a , and 
N i and IV 2 are the total numbers of cases in the two samples 
to which the proportions relate. 1 (In computing this value 
and applying the corresponding test it is necessary to divide 
percentages by 100, to reduce them to the form of propor- 
tions or ratios.) 

A tabulation of American and foreign business cycles 
by Wesley C. Mitchell has indicated a relative preponder- 
ance of three-year cycles in American experience. Of 32 

1 See Hornell Hart, '‘The Reliability of a Percentage,” Journal of the 
American Statistical Association , Vol. 21, March, 1926. 
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American cycles 10, or 31.2 per cent, lasted 3 years; of 
134 cycles in other countries 20, or 14.9 per cent, lasted 
3 years. 1 Is the difference between these two percentages 
great enough to justify the inference that the forces acting 
upon American business differ from those acting abroad, 
creating a significantly higher percentage of three-year 
cycles? The hypothesis that we test in this case is that the 
difference is not significant, that the groups of American 
and foreign business cycles are drawn from the same 
universe. 

The two proportions, pi and p 2 , with which we work are 
.312 and .149. The difference D p between the two pro- 
portions is .312 — .149 or .163. For the weighted mean 
proportion we have 

_ Nipi + Nifh 

Po Ni + N 2 

(32 X .312) + (134 X .149) , OA , 

32 + 134 1804 ' 

q a = 1 — p 0 = .8196. 

We compute the standard error of D p from the relation- 
ship shown above 

1804 X. 8196 (i + jij) 

= .005724 
<r Dp = .0757. 

Between the given value of D p and the hypothetical value 
of zero we have the discrepancy (expressed as a normal 
deviate) 

rp . .163-0 
.0757 

= 2.15. 

A discrepancy as great as this or greater might occur, 
as a result of chance, about 3 times out of 100. If our stand- 

1 See Business Cycles , the Problem and Its Setting , N. Y., National Bureau of 
Economic Research, 1927, 399-400. 
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ard of significance is 1 out of 100 we must conclude that 
the difference between the two percentages is not clearly 
significant. The result is not inconsistent with the hypothesis 
we set out to test — that American and foreign business 
cycles are drawn from the same universe, in respect of 
the proportion of three-year cycles occurring. It is proper 
to say, however, that we are dealing with border line results. 
If our standard of significance were 1 out of 20 we should 
consider the difference between American and foreign 
experience significant. Perhaps we should say that although 
the present evidence does not provide conclusive proof that 
the two samples come from different universes, there is 
indication of a difference between the forces affecting the 
relative frequency of three-year cycles in the United States 
and in foreign countries. Such results call for further 
research, in order that a more definite conclusion may be 
reached. 

SAMPLING ERRORS AND SIGNIFICANT FIGURES 

In deciding upon the number of figures to be recorded 
as significant, measures of sampling errors are, of course, 
pertinent. A useful general rule laid down by Truman L. 
Kelley follows: In a final published constant, retain no figures 
beyond the position of the first significant figure in one third 
of the standard error; keep two more places in all computa- 
tions. 1 Its application may be illustrated with reference to 
the figures on hourly earnings of 11,404 steel workers in 
1933. The mean, to four places, is 50.1360 cents. The 
standard error of the mean is . 175 cents. One third of this 
is .0583. The first significant figure is in the column of 
hundredths. By the rule, therefore, the arithmetic mean 
should be given as 50.14 cents. Two more places, or four 
decimal places in all, should be retained in calculations. 

1 The rule here given is the Kelley suggestion as re-phrased by P. J. Rulon 
0 Science , N. S. Vol. 84, No. 2,187, Nov. 27, 1936, 484). I have changed “one 
half the probable error” in Rulon’s statement to “one third of the standard 
error.” 
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Some Limitations to Measures of Sampling Errors 

The importance of such measures of reliability as have 
been discussed above is, of course, great. With their aid 
we may give precision to our judgments concerning the 
margins of error involved in extending statistical results 
beyond the limits of actual observation. Yet limitations 
attach to them, and these must not be forgotten in a purely 
mechanical application of statistical tests. 

Reference has been made to limitations relating to the 
size of samples. In the interpretation of most measures of 
sampling errors the assumption is made that statistical 
measurements secured from successive samples are dis- 
tributed in accordance with the normal law of error. When 
the number of cases is large this is approximately true, even 
though the original data are not so distributed. But with 
a small number of cases in each sample this assumption 
may be quite invalid. The significance of given deviations 
(in terms of T) is therefore materially altered when we are 
dealing with results secured from small samples. Techniques 
have been developed, however, for defining sampling errors 
based on small samples. These are discussed at a later 
point (Chapter XVIII) . 

Moreover, the conventional standard errors we have 
discussed can be assumed to measure only errors arising 
from the fluctuations of simple sampling. If there is to be 
full conformity to the conditions of simple sampling, the 
probability of a given event occurring must be the same 
in all parts of the universe being sampled and for all time 
periods included, and the individual events (i.e., drawings 
or observations) must be completely independent of one 
another. The fact that customary error formulas are 
strictly applicable only when these conditions have been met 
injects elements of doubt into many statistical inductions 
in the field of economics. We cannot always be sure that 
the conditions of simple sampling are actually fulfilled. 
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They are rarely perfectly fulfilled in the handling of economic 
data. The standard errors derived above can give no 
indication of the possibility of fluctuations in successive 
samples due to causes other than those arising from simple 
sampling. Fluctuations due to bias, due to lack of repre- 
sentativeness in the sample, due to persistent errors of 
any sort, quite elude this method of determining probable 
stability. Although some degree of departure from the 
rigid conditions of perfect sampling does not deprive the 
measures of reliability of all value, the limitations noted 
must be the constant concern of the statistician. 

The element of time adds one serious difficulty to the 
problem of statistical induction in the realm of economics, 
and in the social sciences generally. A universe that extends 
over time is subject to elements of change that are not 
present among data relating to a cross-section of time. 
Conditions of pig iron production, of banking, of foreign 
trade, of income distribution change from year to year, 
even from month to month. We may hardly assume that 
data relating to different time periods reflect the play of 
identical forces. When we deal with data from different 
periods we are, as Oskar Anderson has pointed out, drawing 
from different universes. The structural changes that occur 
in economic organization are manifestations of this state 
of never-ending transition. Accordingly the homogeneity 
of all populations extending over time is suspect. In par- 
ticular are hazards faced when an induction extends to a 
time period not covered by the data of observation. 

The fitting of trend lines, and the use of deviations from 
trend in statistical analysis, represent one effort to overcome 
difficulties arising 'out of temporal change. It is assumed 
that variations due to trend reflect the deep-seated changes 
that would introduce elements of heterogeneity into the 
particular universe of inquiry, and that deviations from 
trend may be made the bases of statistical inference. The 
effects of some temporal changes are doubtless removed by 
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this process. But the argument cannot justify the extension 
to a new time period of measures of sampling error based 
on the study of another period, unless it can be established 
that no essential change occurred in the conditions affecting 
the phenomena in question. The probable errors involved 
in such extension, without the validation noted, are not 
capable of definition. For this extension would involve 
generalizing about one universe from the study of an- 
other. 

In the application of statistical methods proper choice 
of objectives, wise planning, and effective field work are 
of at least equal importance with skill in the use of statistical 
techniques. This is especially true as regards problems 
of sampling. Here chief emphasis falls on soundness and 
accuracy in the field work. The problems of field work are 
specialized and particular, arising out of specific problems 
and conditions. Appropriate special knowledge is needed 
for the selection and validation of the sample. 

Much may be done to strengthen a statistical induction 
by making actual statistical tests of the homogeneity of 
the population and of the stability of sampling results. 
By the study of successive samples the representativeness 
of statistical measures may be determined; and by testing 
the subordinate elements of a given sample, when broken 
up into significant sub-groups, the inherent stability of a 
sample may be checked. The uniformity of nature in a 
given field is assumed in every induction. The induction is 
strengthened by every piece of evidence that supports the 
assumption. 
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CHAPTER XV 


THE ANALYSIS OF VARIANCE 

The determination of degree of correlation between vari 
ables involves, essentially, the comparison of measurements 
of variability. Thus, in the familiar equation 



we are comparing the dispersion about the fitted line of 
regression ( S y 2 ) with the dispersion about the mean of the 
y ' s (<r„ 2 ). Again, if we work with the relation 



we are comparing the dispersion of the computed values 
of y about the mean of the y’s (<r c „ 2 ) with the dispersion 
of the original observations about the mean of the y’s (<r» 2 ). 
It is logical thus to compare measurements of variation, 
in applying correlation technique, for the purpose of the 
investigator is usually to test an hypothesis concerning the 
forces responsible for variation in the dependent variable. 
He is usually seeking an associated factor which may, on 
some rational basis, be assumed to influence the fluctuations 
of the variable he is treating as dependent. R. A. Fisher 
has developed a procedure to employ in the study of correla- 
tion which is based explicitly upon the analysis of variance. 
We deal in this chapter with certain applications of the 
flexible and powerful instrument Fisher has forged. 

Comparison of Measures of Variability 

We deal first with a simple comparison of two groups, 
in respect of variability. The prices of preferred and com- 

400 
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mon stocks, as quoted on the New York exchanges, may 
be compared, to determine whether they differ significantly 
in variability Table 29, presented on a preceding page, 
showed the distribution of closing prices on July 25, 1936, 
of 66 preferred stocks, paying annual dividends of seven 
per cent. With this we may compare the distribution of a 
like number of common stocks selected at random from 
those for which prices were quoted on the New York Stock 
Exchange on July 25, 1936. The required values are given 
in Table 111. 


Table 111 


Comparison of Preferred and Common Stocks in Respect of Price 

Variation 



Degrees 

of 

freedom 

(n) 

Sum of 
squares of 

Mean 

square 

Standard 

Common 

logarithm 

of 

standard 

deviation 

Natural 

logarithm 

of 

standard 

deviation 


deviations 

from 

deviation 

(variance) 

deviation 

O’ 


mean 

a 2 


log w <r 

log, a 

Common 

stocks 

Preferred 

65 

99,327.28 

1,528.112 

39.09 

1.59207 

3.66590 

stocks 







(seven 
per cent) 

65 

30,812.20 

474.034 

21.77 

1.33786 

3.08056 


Difference = 0.58534 

Each distribution includes 66 observations. (It is not 
essential to this comparison that the number of observations 
in the two distributions be equal.) In computing the mean 
square deviation we divide the sum of the squared deviations 
from the mean by n, the number of degrees of freedom, 
which is here nqual to one less than the total number ol 
observations in each distribution, that is, to N— 1. (More 
is said below about the determination of number of degrees 
of freedom.) The standard deviation of the common stocks, 
39.09, is materially greater than the corresponding figure. 
21.77 for preferred stocks, but we cannot tell by inspection 
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whether the difference is significant, or whether it merely 
reflects a fluctuation of sampling. A precise test may be 
made by using the coefficient 2 as a measure of the difference 
in variability. 

This coefficient is equal to the difference between the 
natural logarithms of the two standard deviations. That 
is 

2 = log, <T x - log, <Tf 

It is to be noted that natural logarithms are to be employed. 
Common logarithms on the base 10 may be shifted readily 
to natural logarithms on the base e (2.71828) by using the 
factor 2.3026 as a multiplier. From the entries in the 
last column of Table 111 we derive .58534 as the value 
of 2 . 

If common and preferred stocks were alike, with respect 
to the dispersion of their prices, and if we had sufficiently 
large samples so that sampling fluctuations did not affect 
the measures of variance, the value of 2 would be zero. 
Is the value we have derived consistent with the hypothesis 
that the true value of z is zero? Could sampling fluctuations 
alone account for a deviation as great as .58534 from a 
true value of zero? If the derived value of z is too great 
to be attributed to sampling fluctuations, the hypothesis 
that common and preferred stocks are alike, with respect 
to the dispersion of their prices, is untenable. 

To determine whether the derived value of 2 is consistent 
with the hypothesis that its true value is zero, we must 
know something about the distribution of values of z, if 
these were computed from many samples drawn under 
the same conditions. Fisher has shown that this distribution 
is normal, or effectively so, when the two distributions 
being compared both include a large number of observations. 
This is also true when the two distributions include only 
a moderate number of observations, but with % and n 2 
equal or nearly equal. The standard deviation of a dis- 
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tribution of z’s secured under these conditions, or the 
standard error of z, is a function of the two n’s. It may be 
derived from the relationship 

a ’ = + 

where n x and are the number of degrees of freedom in 
the two distributions. 

In the present example n x and n 2 are both equal to 65; 
the standard error of z is equal to the square root of the 
reciprocal of 65. We have 

<r z = V70I538 = . 124. 

The test of the hypothesis that the true value of z is zero 
reduces, then, to the question whether a value of .58534 
is likely to be drawn from a normally distributed population 
with a mean value of zero and a standard deviation of 
.124. A value of .58534 represents a deviation of 4.72 
standard deviations from zero (i.e., z/<r z = 4.72). A devia- 
tion as great as this occurs so seldom, in random sampling, 
that we may not accept the conclusion that the present 
value represents a chance deviation from zero. The result 
is not consistent with the hypothesis that the true value 
of z is zero. The dispersion of common stock prices is 
significantly greater than the dispersion of the prices of 
preferred stocks paying seven per cent dividends. 

To exemplify a different condition, we may compare the 
dispersion of prices of preferred stocks paying six per cent 
and of preferred stocks paying seven per cent dividends. 
We have 64 quotations on the former, 66 quotations on 
the latter, both relating to closing prices on the New York 
Stock and Curb Exchanges on July 25, 1936. The figures 
are given in Table 112 on page 494. 

In this comparison the value of z is . 02890. The standard 
error of z (the square root of half the sum of the two recipro- 
cals) is .12502. The coefficient z deviates from zero by 
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Table 112 


Comparison of Six Per Cent and Seven Per Cent Preferred Stocks 
in Respect of Price Variation 



Degrees Sum of 
of squares of 
freedom deviations 
( n ) from mean 

Mean 

square 

Standard 

Natural 

logarithm 



deviation 
( variance ) 
a 2 

deviation of standard 
a deviation 

logs a 

1/n 

Seven per cent 







preferred 

stocks 

65 

30,812.2 

474.034 

21.77 

3.08056 

.0153846 

Six per cent 







preferred 

stocks 

63 

28,176.0 

447.222 

21.15 

3.05166 

.0158730 




Difference — 

0.02890 Sum - 

.0312576 


an amount equal to about one fourth of the standard error 
of z (z/<r t = .23). This, of course, is a deviation that would 
occur very frequently in a normally distributed variate 
with mean value of zero. The result is, therefore, consistent 
with the hypothesis that the true value of z is zero. There 
is no significant difference between six per cent and seven 
per cent preferred stocks in respect of the dispersion of 
their quoted prices. 

The Testing of Variability between Classes 

The comparison of standard deviations provides a means 
of answering questions of another type. Measurements of 
changes in the average selling prices of products of manu- 
facturing industries may be used to exemplify the procedure. 
If we classify manufacturing industries into those producing 
perishable, semi-durable, and durable goods, and compute 
an average of changes occurring between 1929 and 1933 
in the selling prices of the products of each of these categories, 
we obtain the index numbers given in Table 113. 

The average decline in prices was much less among durable 
manufactured goods than among goods of the other classes; 
semi-durable goods suffered the greatest loss. The range 
of variation among the three averages is considerable, but 
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Table 113 


Measurements of Average Changes in Selling Prices, 1929-1933, in 
Three Groups of Manufacturing Industries 


Class of industry 

Producing perishable goods 
Producing semi-durable goods 
Producing durable goods 
All industries 


No. of 

Index of selling pr 

industries 

1929 

1933 

34 

100 

69.81 

26 

100 

66.41 

25 

100 

78.96 

85 

100 

71.46 


on the basis of the evidence here given we are not able to 
say whether the observed differences are due to chance, 
merely, or whether the prices of these several classes of 
goods were subject to the play of quite different forces, 
during the period here covered. An objective test is needed, 
before we may assume that the observed differences are 
significant. 

For the application of such a test we need a measure of 
variation which is independent of the principle of classifica- 
tion here employed. How much might a series of price 
relatives for 1933, on the 1929 base, be expected to vary 
as a result of the play of chance? (By “chance” we here 
mean the mass of causes unrelated to the factor of relative 
durability .) A measure of the strength of such causes is 
provided by the variation within the three classes we have 
set up. The method used in measuring the variation within 
these classes is indicated in Table 114 on page 496. 

It will be understood that the deviations which, in 
squared form, enter into the sums in the last column are 
the differences between individual items and the means 
of the classes in which those items fall. Thus the relative 
measuring the average selling price of products of the meat 
packing industry in 1933 was 44.90 on the 1929 base. 
This industry falls in the perishable goods group. The 
difference between 44.90 and 69.81 is 24.91. The square 
of this, or 620.5081, is one of the 34 items making up the 
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Table 114 

Illustrating the Measurement of Variation within Classes 


(1) 

(2) 

(3) 

Mean of price 

(4) 

Sum of squares of 

Class of industry 

No. of 

relatives 

deviations of individual 

industries 

(1933 on 1929 
base) 

price relatives from 
class mean 

Producing perishable 

goods 

34 

69.81 

6,464.0275 

Producing semi- 

durable goods 

26 

66.41 

3,375.1849 

Producing durable 

goods 

25 

78.96 

5,725.6916 

All industries 

85 


15,564.9040 


figure 6,464.0275, the entry for perishable goods in the 
last column of the preceding table. The sum of the entries 
in this last column, 15,564.9040, represents variation in 
price changes within the three classes. It is not influenced 
by factors of perishability or durability, since the total is 
affected only by variation among perishable goods, variation 
among semi-durable goods, and variation among durable 
goods. 

Eighty-five items enter into this total. However, only 
82 degrees of freedom are present. The 34 perishable 
goods possess 33 degrees of freedom to vary, the 26 semi- 
durable goods possess 25 degrees of freedom, and the 25 
durable goods possess 24 degrees of freedom. For the 
standard deviation defining variability within classes we 
have, therefore 

, - = 13.78. 

This figure provides us with a yardstick, a measure of the 
degree of variation that is independent of the principle 
of classification employed in distinguishing perishable, semi- 
durable, and durable goods. This measures the variation 
due to the mass of floating causes known as chance. 
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With this standard we may compare the differences 
between the three class averages presented in Table 113. 
The magnitude of these differences may be defined by 
a single measurement, a standard deviation. In its computa- 
tion the deviation of each class mean from the grand mean 
is measured, and the square of this deviation is multiplied 
by the number of items in the class in question. The proce- 
dure is illustrated in Table 115. 


Table 115 


Illustrating the Measurement of Variation between Classes 


( 1 ) 


Class of 
industry 


( 2 ) 


No. of 

industries 


(3) (4) 

Mean Deviation of 
of price class mean 
relatives from mean of 
(1929 all 

= 100) observations 


(5) 

Square of 
deviation of 
class mean f rom 
mean of all 
observations 

(4) 2 


( 6 ) 

Weighted 
squared 
deviation 
(2) X (5) 


Producing 

perishable 


goods 

Producing 

34 

69.81 

- 1.65 

2.7225 

92.5650 

semi- 






durable 

goods 

26 

66.41 

- 5.05 

25.5025 

663.0650 

Producing 






durable 

goods 

25 

78.96 

+ 7.50 

56.2500 

1,406.2500 


2,161.8800 


The sum of the entries in column (6), 2,161 . 8800, measures 
the total variation between classes. Although weights are 
used in getting this total, the differences relate to three 
separate averages, only, and but two degrees of freedom 
are represented in the total. As a measure of the degree 
of variation between the three broad categories we have 
set up, we have 

. - _ 32.88. 
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This we may take as a measure of the difference in degree 
of change in selling price from 1929 to 1933 that appears 
to be related to the relative durability of products. 

The next step involves the formal testing of this figure 
against the standard provided by the measures of variation 
within classes. This test is applied in Table 116. Certain 
necessary calculations are also indicated. 


Table 116 


Comparison of Measures of Variation 


Degrees 

Nature of of 
variability freedom 
n 


Sum of 
squared 
deviations 1 


Mean 

square 

deviation 

a 2 


Standard 

deviation 

a 


Logw a 


Log e c 


Between 

classes 2 2,161.8800 1,080.9400 32.88 1.51693 3.49288 

Within 

classes 82 15,564.9040 189.8159 13.78 1.13925 2.62324 

Total 17,726 . 7840 Difference = z = 0 . 86964 


The test reduces, it is clear, to a comparison of two 
measures of variability. One, the standard of comparison, 
is the measure of variance within classes, a measure com- 
pletely independent of the perishability or durability of 
the product. The other is a measure of variation between 
classes. Such variation might be due to the same general 
mass of causes responsible for variation within classes, or 
it might be due to special forces related to differences in 
the durability of the goods in question. If the former 
explanation is correct, the two measures of variation should 
be of the same order of magnitude, with due allowance 

l The figure 17,726.7840, which is the sum of the squared deviations of 
individual observations from their respective class means and of the squared 
deviations of the several class means from the mean of all the observations, is 
equal to the sum of the squared deviations of all the individual observations 
from the mean of all the observations. In the table the total has been broken 
into two components, representing variability between classes and variability 
within classes. 
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for sampling fluctuations. If the second is the correct ex- 
planation the two measures of variation may differ appreci- 
ably in magnitude. The test, therefore, reduces to the 
question: Is the variation between classes significantly dif- 
ferent from the variation within classes, account being 
taken of the degrees of freedom present in the two cases? 

This question could be answered with reference to the 
standard error of z, provided the distribution of z be normal, 
or approximately normal. This is the case when the n’s 
that measure the number of degrees of freedom are both 
large or when, though of moderate value, they are equal 
or nearly equal. This condition prevailed in the examples 
cited earlier. It is not met in the present instance, so we 
may not with accuracy employ the method of estimating 
and utilizing the standard error of z that was used in the 
earlier case. When the numbers of degrees of freedom are 
unequal and relatively small, as in this case, tests of signifi- 
cance may be most readily made with reference to a tabula- 
tion of values of z, prepared by R. A. Fisher. This tabulation 
gives, for various values of and n 2 , values of z that would 
be exceeded 5 times out of 100, as a result of chance, if 
the true value of z were zero; it also gives one per cent 
values of z, i.e., values of z that would be exceeded 1 time 
out of 100, under conditions of random sampling, if the 
true value of z were zero. These two sets of values are 
reproduced in Appendix Tables VI and VII of this book, 
through the courtesy of Dr. Fisher and Oliver and Boyd, of 
Edinburgh, his publishers. 1 

In the present example the value of z, defining the degree 
of difference between the two measures of variation in 

1 Uses of the function z are discussed in R. A. Fisher's book, Statistical Meth- 
ods for Research Workers , Edinburgh, Oliver and Boyd, sixth ed., 1936. 

A table similar to Fisher's z~ table, but relating to an alternative measure F, 
has been constructed by George W. Snedecor. F is derived directly from the 
variances (i.e., the values of a 2 ) that are being compared; it is the ratio of the 
larger of the two variances to the smaller. For a table of values of F and a 
discussion of its uses see George W. Snedecor, Statistical Methods , Ames, 
Iowa, Collegiate Press, 1937. 
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Table 116, is .8696. In entering the z-table for the purpose 
of testing this measurement, the number of degrees of 
freedom corresponding to the larger of the two measures 
of variation compared is taken as n\, n 2 is the number 
of degrees of freedom corresponding to the smaller of the 
two measures. This is a necessary procedure, with reference 
to the table as constructed. In the problem that now con- 
cerns us »i = 2, ns = 82. 

For n i=2 and n 2 = 60, the 1 per cent value of 2 is 
.8025; for ru = 2 and n 2 = °°, the 1 per cent value of z 
is .7636. Interpolating, we obtain .7920 as the 1 per cent 
value of 2 for n, = 2, n 2 = 82. 1 If the true value of z were 
zero, we should expect a value as great as . 7920, or greater, 
to occur as a result of chance only 1 time out of 100. The 
present value of z materially exceeds .7920; the probability 
of a value as great as this occurring as a result of chance, 
if the true value of z were zero, is less than 1 out of 100. 
The results of the test are not, therefore, consistent with 
the hypothesis that the true value of z is zero. The differ- 
ences between the three class averages shown in Table 113 
are too great to be attributed to chance. We may conclude 
that the price movements of perishable, semi-durable, and 
durable manufactured goods between 1929 and 1933 were 
significantly different. 

1 Interpolation in the z-table is based upon direct proportions among the 
reciprocals of the n’s. In the above case 

for rii = 2, ri 2 = 60: the 1 per cent value of z =» .8025 l/n 2 = 1/60 = .0167 
for u\ = 2, tt 2 = » : the 1 per cent value of z = .7636 l/n 2 = l/« == .0000 

A - .0389 A = .0167 

We must find the 1 per cent value of z corresponding to ni = 2, nj = 82. 
For l/ft 2 we have 

1/82 = .0122. 

The difference between 1/82 and 1/ », for which we must interpolate between 
the given values of z, is . 0122 — . 0000 = .0122. The required 1 per cent value 

of z - .7636 + ^“oi67 X ° 389 / " - 7920 * 

The process of interpolation on the rii scale, if required, would be similar. 
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Variance Analysis in the Measurement op 
Relationship 

The procedure employed in the comparison of measures 
of variability is applicable to the measurement of correlation. 
Indeed, using this technique it is possible to employ a 
systematic procedure that is of great value in revealing 
the character and degree of the relationships prevailing 
between variable quantities. This procedure is illustrated 
in the next section. 

The method employed in applying to a typical correlation 
problem the method of analysis based on comparison of 
variances may be illustrated with reference to the data of 
alfalfa yield previously studied. These are presented in 
Table 117. 


Table 117 

Summary of Results Secured in Experiments with Alfalfa 

(The measurements in the body of the table measure yields, in tons per acre, 

in 44 experiments) 

Inches of irrigation water applied 



0 

12 

18 

24 

30 

36 

48 

60 


2.35 

4.31 

5.69 

6.00 

7.53 

7.58 

8 05 

5.55 


2.75 

4.78 

6.46 

6.89 

7.97 

8 22 

8.45 

7.25 


2.89 

4.84 

7.02 

7.96 

8.32 

8 63 

8.63 

10 17 


3 85 

5.83 

8.02 

8.32 

9.43 

9.33 

8.83 

10.70 


5.52 

G.51 


8.38 

9 54 

9.38 

9 52 


Average 

5.94 

7.52 


9.96 

11.06 

12.48 

10 62 


yield 

3.88 

5.63 

6.80 

7.92 

8.98 

9.27 

9.02 

8.42 7.48 


The average yield of alfalfa, in these 44 experiments, 
was 7 . 48 tons per acre. But there was rather wide variation 
among the results. The sum of the squares of the deviations 
of the 44 observations from the mean is 228.33. This 
s um sets our problem. We should like to find reasons 
for this variation. 
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TESTING FOR THE EXISTENCE OF CORRELATION 

The observations are set up above in a form suited to 
the testing of one hypothesis concerning the factors affect- 
ing alfalfa yield. The data are arranged in eight arrays, 
classified according to the depth of irrigation water applied. 
This depth varied from 0 to 60 inches. Variations in yield 
appear to be associated with variations in amount of water 
applied. As a basis for our procedure we set up the hypothe- 
sis that there is no such association. To test this hypothesis, 
we may break the sum that measures the total variation 
of yields into two parts measuring, respectively, the variation 
within arrays and the variation between arrays. 

To determine the total variation within arrays, the devia- 
tion of each observation from the mean of the array in 
which it falls is measured. The sum of the squares of these 
deviations, for all the arrays, is the desired total. Thus, 
in the first array of Table 117, the mean is 3.88 tons. 
The deviation of the first observation, 2.35, from this 
figure, is — 1.53; its square is 2.3409. The deviation of 
the second observation, 2.75, is — 1.13; its square is 
1.2769. Determining in similar fashion the deviations of 
the four other observations in that array from the mean 
of the array, squaring these, and adding the six squared 
values, we have 11.5320 as the sum of the squares of the 
deviations in the first array. Performing similar calculations 
for the seven other arrays, and adding the eight sums thus 
secured, we have a figure of 76 . 39. This is the total varia- 
tion within arrays. For convenience we may refer to this 
as component A of the total variation. 

In determining the total variation between arrays, the 
deviations of the means of the various arrays from the 
mean of all the observations are measured and squared, 
and the weighted sum of these squares is secured. Weights 
are based upon the number of observations in the several 
arrays. Thus the mean of the first array, 3.88 deviates 
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from the mean of all the observations, 7.48, by 3.60; 
the square of this is 12.9600. Multiplying by six (the 
number of observations in the first class), we have 77.7600. 
Securing similar weighted figures for the seven other arrays, 
and adding, we have 151.94 as the total variation between 
arrays. This we may call component B. 

In breaking up the total variation into two components 1 
we have distinguished variations in yield that are definitely 
not related to differences in depth of irrigation water applied, 
from variations in yield that may or may not be related 
to irrigation differences. Within the first array, including 
six experiments on plots to which no irrigation water was 
applied, yields varied from 2 . 35 tons to 5 . 94 tons per acre. 
The total variation within this array (the sum of the squares 
of the deviations from the mean of the array) amounted 
to 11.5320. Since the irrigation factor was constant, this 
sum measures variation which is completely independent 
of changes in irrigation. This is true also of the figure 
76 . 39, measuring total variation within all the eight arrays 
set up in Table 117. Differences in soils and innumerable 
minor factors combined to create variation within these 
arrays. The figure 76.39 measures the play of that host 
of undefined forces to which we give the name chance. 
The one specific factor which does not affect this figure is 
irrigation. We have measured the variation in such a 
way that irrigational differences do not enter. 

Irrigational differences do enter definitely into the varia- 
tion between arrays. Indeed, it may be the dominant 
factor in this variation, which is measured by the figure 

1 The sum of the two components is, of course, equal to the total. 

Variation within arrays (Component A) 76.39 

Variation between arrays (Component B) 151.94 

Total variation 228 . 33. 

For a demonstration of this relationship see note, pp. 418-9. 

To ensure full consistency between components A and B and the total 
(and among the sub-divisions of B later defined), when these quantities are 
independently computed, it is necessary that all computations be carried to 
more decimal places than are customarily retained. 
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151.94. But of this we cannot be sure. For the means of 
the eight arrays differ among themselves not only because 
of differences in the amounts of irrigation water applied to 
the different plots. To yield differences due to the irrigation 
factor are added yield differences due to the innumerable 
other forces that influence alfalfa yield, the forces we lump 
together as chance. For chance factors affect the means 
of the various arrays, and so affect the variation between 
arrays, just as they affect the variation within arrays. As 
the experiment was designed, the influence of irrigational 
differences is present only in the variation between arrays, 
but the influence of “chance ” is present in both the variation 
within arrays and the variation between arrays. 

In this fact is found the key to our problem, and the 
instrument for testing our hypothesis. For, in so far as 
chance alone is operative, the variation between arrays 
would be expected to be of the same order of magnitude 
as the variation within arrays. The figures we have so 
far examined indicate that the variation between arrays 
is greater than the variation within arrays. But this may 
be a purely fortuitous result. The apparent increase of 
yield with increased irrigation may be entirely a chance 
phenomenon, similar to a run of heads in tossing a coin. 
This we must test. We must determine whether the forces 
responsible for variation between arrays are the same as 
the forces responsible for variation within arrays. 

The hypothesis we shall test, and which may of course be 
disproved, is that the forces responsible for variation between 
arrays are the same as the forces responsible for variation 
within arrays; in other words, that there is no association 
between depth of irrigation water applied and alfalfa yield. 
The nature of the test to be applied has been indicated 
in the preceding sections. We shall compare two measures 
of variation, to determine whether they are of the same 
order of magnitude. But before this test is applied, account 
must be taken of the number of degrees of freedom pre- 
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vailing in each case. This concept calls for brief explana- 
tion. 

If the data of alfalfa yield related to but one plot of 
land, in one year, there would be no variation. A single 
observation would coincide with the mean, and the standard 
deviation would be zero. With a second observation oppor- 
tunity for variation arises. But we may think of it as a 
single opportunity. With but two observations there is 
but one degree of freedom to vary. With three observations, 
two opportunities to vary are given; there are two degrees 
of freedom. In problems of this sort the number of degrees 
of freedom is equal to N — 1. Our present example includes 
44 observations; hence the total variation 228.33 represents 
the resultant of 43 degrees of freedom. 

How are these 43 degrees divided between the two 
components, A and B ? As regards variation within arrays, 
this may be readily determined by reference to Table 117. 
Variation within arrays, it will be recalled, was measured 
with reference to the means of the various arrays. In the 
first array, containing six observations, there exist five 
degrees of freedom to vary from the mean of that array. 
The same is true of the arrays relating to 12, 24, 30, 36, and 
48 inches of irrigation water. In each of the arrays relating 
to 18 and 60 inches of water there are but four observations, 
with three degrees of freedom. The total of these degrees 
of freedom is 36. Variation between arrays was determined 
by measuring the deviations of the means of eight arrays 
from the general mean of the distribution. Since eight 
different values are involved, there are seven degrees of 
freedom. (The fact that weights were employed in securing 
the total variation between arrays does not affect the deter- 
mination of degrees of freedom.) The 36 and the 7, combined, 
use up all the 43 degrees of freedom entering into the total 
variation. 

Knowing these degrees of freedom we may now reduce 
the measures of variation within arrays and of variation 
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between arrays to comparable terms, and determine the 
significance of the difference between them. This is done 
in Table 118. This table and others following differ somewhat 
from those employed in similar comparisons in the opening 
sections of this chapter. In the earlier tables variability 
was measured in units of the standard deviation, and the 
function z was derived from the relationship 

2 = loge <Ti — log, <r 2 . 

It is often more convenient to perform the necessary calcula- 
tions in terms of the variance, that is, of <r 2 , and to derive 
2 from the relationship 

log. (Ti 4 — log. <Tj 2 

z = 

2 

The procedures lead to the same result, of course, since 
half the difference between the logarithms of the squared 
standard deviations is equal to the difference between the 
logarithms of the standard deviations, but the use of squared 
measurements eliminates one step in the calculation. 


Table 118 

A Test of the Existence of Correlation 


Nature of 

No. of 
degrees of 

Sum of 

Mean 

square 

Natural 
logarithm 
of mean 
square 
logger 2 

variability 

freedom 

(n) 

squares 

(variance) 

<r 2 

Within arrays 
(Component A) 
Between arrays 

36 

76.39 

2.12 

0.7514 

(Component B) 

7 

151.94 

21.71 

3.0778 




Difference 

« 2.3264 




z 

• 1.1632 

When we divide the sums of the squares by the corre- 


sponding figures defining degrees of freedom, we have com- 
parable measures of variance. Now it appears that the 
variance between arrays (21.71) is distinctly greater than 
the variance within arrays (2. 12), in disproof of the hypothe- 
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sis that the same forces account for the two variances. 

But we have a precise test to employ in determining 

whether these two variances are of the same degree of 

magnitude, within sampling limits. This is the coefficient 

z, which is half the difference between the natural logarithms 

of the two variances. In the present case, z is equal to 

3.0778 - .7514 
o > or 1.1632. 


If the forces responsible for variation within arrays were 
the same as those responsible for variation between arrays 
(that is, if our hypothesis were true), the value of z would 
be zero, with a sample of infinite size. The value of z we 
have secured is not zero. This may be proof that our 
hypothesis is false, or it may merely be a result of sampling 
fluctuations. The value of z might be zero in a given infinite 
population, but a random sample would be expected to 
yield results deviating considerably from zero. We wish 
now to take account of sampling fluctuations, in determining 
whether the result we have secured is consistent with the 
hypothesis that the true value of z is zero. 

In determining the significance of the present results 
we enter Appendix Table VI with ni (the number of degrees 
of freedom corresponding to the larger variance) equal to 
7 and n 2 equal to 36. Interpolating in Table VI, we find that 
the 1 per cent value of z corresponding to the stated values 
of «i and n 2 is .5780. 1 A value as great as this or greater 


1 It is necessary to interpolate on both scales of the 2 -table. First, following 
the procedure indicated on a preceding page, we interpolate in respect of n*. 
We obtain 

for ni = 6, n% =* 36, the 1 per cent value of 2 = .6047; 1/6 =» .1667 
for ni =» 8, n 2 * 36, the 1 per cent value of z « .5580; 1/8 = .1250 

A - .0467 A - . 0417. 

We must now interpolate on the ni scale, since the degrees of freedom are 
ni = 7, n 2 = 36. For 1/ni we have 1/7 =. 1429. The difference between 1/7 
and 1/8, for which we must interpolate between the given values of 2 , is 
.1429 - .1250, or .0179. 

( . 0179 \ 

X .0467 J — .5780. 
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would occur only 1 time out of 100, as a result of sam* 
pling fluctuations, if the true value of z were zero. The 
actual value, 1.1632, far exceeds the 1 per cent value 
of z. The evidence strongly indicates that z deviates from 
zero not because of the play of chance, but because the 
forces responsible for variation between arrays are of a 
different order from those responsible for variations within 
arrays. We are justified in concluding that our results 
are not consistent with the assumption that the true value 
of z is zero. The hypothesis that the forces responsible 
for variation between arrays are of the same character 
as those responsible for variation within arrays is not 
tenable. The results indicate the presence of a real con- 
nection between alfalfa yield and depth of irrigation water 
applied. 

TESTING THE HYPOTHESIS OF A LINEAR RELATIONSHIP 

Since it appears that there is a relationship between 
these two variables, it is now in order to secure an acceptable 
function, defining the relationship in quantitative terms. 
We may do this by testing, in turn, various hypotheses 
concerning the form of this function, until we secure one 
with which the observations are not inconsistent. We shall 
start with the hypothesis that there is a linear relationship 
between alfalfa yield and depth of irrigation water applied. 1 

The first step in applying the present test is to fit a 
straight line to the means of the eight arrays shown in 
Table 117. Variation among these means (component B 
of the total variation) reflects the presence of correlation 

1 Each hypothesis tested should be rational, acceptable on logical grounds. 
If we are thinking of general relationships, prevailing over the entire range of 
possible observation, the assumption of a straight-line relationship between 
alfalfa yield and amount of irrigation water applied is not tenable. For it is 
not to be expected that increased irrigation will increase yield without limit. 
In the present case we test the hypothesis of a linear relationship in order that 
the demonstration of procedure may be systematic and complete, although that 
hypothesis is not a rational one, even within the range of the present observa- 
tions. 
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between alfalfa yield and irrigation water applied. If the 
correlation is perfectly linear, all these class means will 
fall on the straight line; all the variation between arrays 
will be accounted for by the hypothesis of a linear relation- 
ship. If the relationship is substantially, though not per- 
fectly, linear, the portion of component B not accounted 
for by linear regression will be insignificant. If the regression 
is not truly linear the residue of B not accounted for (i.e., 
the scatter of the means of the arrays about the straight 
line of regression) will be too great, and some other hypothe- 
sis concerning the character of the relationship between 
alfalfa yield and irrigation water applied must be employed. 

A straight line fitted by the method of least squares 
to the means of the eight arrays is shown in Fig. 82 on 
page 406. The equation to the line is Y = 5 . 038 + . 0886X, 
where Y is alfalfa yield in tons per acre and X is depth of 
irrigation water applied, in inches. [We should note that 
in the fitting proqess the mean of each array is weighted 
by the number of observations in that array. This means, 
merely, that six points are assumed to have coordinates 
of 0, 3.88 (equal to those of the mean of the first array), 
that four points are assumed to have coordinates of 18, 6.80 
(equal to those of the mean of the third array), etc.] In 
Table 119 on page 510 are given the values of the means of 
the various arrays, and the corresponding computed values, 
as derived from the straight line of regression. 

It is clear from the graph and the table that the fit of 
the straight line to the means of the arrays is not perfect. 
The inadequacy of the fit is measured by the sum of the 
squared deviations of the class means from the corresponding 
computed values (each squared deviation being weighted 
by the number of observations in the given class). This 
sum is equal to 44 . 79. 

This sum, to which we may refer as B 2 , is one component 
of B, the variation between arrays. It is that portion of 
the variation between arrays that is not accounted for 
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Table 119 


Alfalfa Yield and Depth of Irrigation Water 
(Class means and values based on linear relationship 
Y = 5.038 + .0886.X) 


(1) 

(2) 

(3) 

(4) 

(5) 

Difference 

(6) 

(7) 

Inches 

No. of 

nhwr- 

Mean 

Estimated 

between mean 



of 

yield 

yield , linear 

yield of class 



water 

vations 

of 

relationship 

and estimated 



(class) 

class 

(tons) 

yield 

(Y P - y c ) 





f 

Y P 

Vo 

d 

d* 

W' 

0 

6 

3.88 

5.04 

- 1.16 

1.3456 

8.0736 

12 

6 

5.63 

6.10 

- .47 

.2209 

1.3254 

18 

4 

6.80 

6.63 

+ -17 

.0289 

.1156 

24 

6 

7.92 

7.16 

+ .76 

.5776 

3.4656 

30 

6 

8.98 

7.70 

+ 1.28 

1.6384 

9.8304 

36 

6 

9.27 

8.23 

+ 1.04 

1.0816 

6.4896 

48 

6 

9.02 

9.29 

- .27 

.0729 

.4374 

60 

4 

8.42 

10.36 

- 1.94 

3.7636 

15.0544 


44.7920 


by the hypothesis of a linear relation between yield and 
irrigation water. The method of deriving the other compo- 
nent of B is shown in Table 120. 

The sum 107.15, to which we may refer as B h is that 
component of the variation between arrays which is 
accounted for by the hypothesis of linear regression. The 
items in col. (3) of Table 120 differ from 7.48, the 
mean of all the observations, for the reason suggested 
by the hypothesis. They differ, on our present assump- 
tion, because with increased applications of water yield 
increases in a manner defined precisely by the equation 
Y = 5.038 + . 0886X. The sum of these variations, 107.15, 
represents, on this assumption, the full effect on alfalfa 
yield of variations of irrigation applications. 

The total of the two sums to which we have referred as 
B\ and B 2 is equal to 151.94, the total variation between 
arrays. Working on the hypothesis that the variables 
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Table 120 

Computation of Variation in Alfalfa Yield Attributable to Irrigation 
Differences on the Hypothesis of Linear Regression 


( 1 ) 

( 2 ) 

( 3 ) 

( 4 ) 

( 5 ) 

Difference 

( 6 ) 

( 7 ) 

Inches No. of 

Estimated 

Mean yield, 

between mean 
yield and 
yield esti- 
mated on lin- 
ear hypothesis 



of 

obser- 

yield , linear 

all obser- 



water 

vations 

relationship 

{tons) 

vations 







<Ve ~ Y) 




f 

Vc 

Y 

d 


fd* 

0 

6 

5.04 

7.48 

- 2.44 

5.9536 

35.7210 

12 

6 

6.10 

7.48 

- 1.38 

1.9044 

11.4264 

18 

4 

6.63 

7.48 

- .85 

.7225 

2.8900 

24 

6 

7.16 

7.48 

- .32 

.1024 

.6144 

30 

6 

7.70 

7.48 

+ .22 

.0484 

.2904 

36 

6 

8.23 

7.48 

+ .75 

.5625 

3.3750 

48 

6 

9.29 

7.48 

+ 1.81 

3.2761 

19.6566 

60 

4 

10.36 

7 AS 

+ 2.88 

8.2944 

33.1776 
107. 1520 


with which we are dealing stand in a linear relationship, 
we have broken the component B of the total variation 
into two portions. One of these (B i) measures the variation 
between arrays that is accounted for by the linear hypothesis; 
the other (B 2 ) measures the variation between arrays that 
is not accounted for by that hypothesis. We should expect 
some departure from linearity in a sample such as ours, 
even though it were drawn from a universe marked by a 
perfect linear relationship. But there are limits to the 
deviations that might reflect fluctuations of sampling. The 
question we now face is whether B 2 is small enough to be 
accepted as the resultant of random factors, or whether 
it is so large as to represent a breakdown of our hypothesis. 

In our earlier discussion we noted that component A 
of the total variation measured the influence of a host of 
random forces affecting alfalfa yield, forces other than the 
irrigation factor. Component A, therefore, serves as an 
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index of the magnitude of random forces, and hence as a 
standard defining the probable limits of sampling fluctua- 
tions, in so far as these are present in component B. We 
may use component A, which relates to variation within 
arrays, as a yardstick in determining whether B* is attribut- 
able to fluctuations of sampling, or whether it is too large 
to be so explained. 

In comparing components A and B 2 account must be 
taken of the number of degrees of freedom present in each. 
This has already been established for A. The following 
tabular summary of the operations just performed may help 


to explain the relations involved for B 2 . 



Nature of variability 

No. of degrees 
of freedom 

Sum of 
squares 

Mean 

square 

Between arrays, due to linear regres- 
sion (Component B i) 

Deviations from straight line of re- 

1 

107.15 


gression (Component B 2 ) 

Total variation between arrays 

6_ 

44.79 

7.47 

(Component B) 

7 

151.94 



The seven degrees of freedom entering into component B 
are divided, one to component B 1 and six to component B 2 . 
That the points on a straight line vary from one another 
with one degree of freedom is clear from a consideration 
of a linear equation y — a + bx. That the values of y 
may differ is due to the presence of the coefficient b, which 
defines the slope. If b were zero, the equation would define 
a horizontal line, with values of y constant. It is the slope 
that constitutes the one degree of freedom among points 
defined by a linear equation. With respect to Bt, we are 
dealing with eight points, to which a straight line has 
been fitted. If there were but two points both of them 
would lie on the line; there would be no possibility of 
deviation. With three points, one degree of freedom to 
deviate is introduced ; with eight points there are six degrees 
of freedom. The degrees of freedom to deviate from any 
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fitted curve are obviously equal to the number of points 
to which the curve is fitted, less the number of constants 
in the equation to that curve. 

Dividing 44.79 by 6 we may secure, then, the value of 
the variance (the mean square) comparable to the variance 
of component A. A test of our hypothesis again reduces to 
a comparison of variances. This appears in Table 121. 


Table 121 

A Test of the Hypothesis of Linear Relationship 


Nature of variability 

Degrees of 
freedom 

Mean square 
(variance) 

Natural logarithm 
of mean square 


n 

a 2 

log, a 1 

Within arrays (Compo- 
nent A) 

Deviation from straight line 

36 

2.12 

.7514 

of regression (Compo- 
nent B 2 ) 

6 

7.47 

2.0109 



Difference = 1.2595 




.6298 


The variation within arrays reflects the play of random 
factors, independent of irrigation. The force of these factors 
is indicated by a variance of 2 . 12. If similar random factors, 
independent of irrigation, were responsible for the deviations 
of the means of the eight arrays from the straight line of 
regression, we should expect the variance that measures 
such deviations to be of the same order of magnitude. 
Actually it is much greater, 7.47. But we cannot say, 
from inspection, that the difference between the two vari- 
ances is not due to fluctuations of sampling. An accurate 
test is needed. We may compute the coefficient z, half 
the difference between the natural logarithms of the two 
variances, and apply such a test. 

From the values given we secure a value of z equal to 
.6298. In determining whether this value is significantly 
different from zero, use must be made again of Fisher’s 
tables. For the values of «i and n 2 are relatively small 
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and unequal, and the distribution of z under these conditions 
would not be sufficiently close to the normal type to justify 
the use of its standard deviation. Entering Appendix 
Table VI with n i equal to 6, n 2 to 36, we find that the 1 per 
cent value of z is . 6047. We take this to mean that, if the true 
value of z were zero, random sampling fluctuations would 
be expected to give a value of z as great as . 6047, or greater, 
only one time out of 100 trials. The actual value of z in 
the present instance is greater than .6047. Only rarely, 
less frequently than one time out of 100, would chance 
account for a value of 2 as great as the one observed. We 
conclude, therefore, that random forces, of the type respon- 
sible for variation within arrays, are not responsible for the 
deviations of the means of the eight arrays from the straight 
line of regression. These deviations are too great to be con- 
sistent with the hypothesis that there is a linear relationship 
between alfalfa yield and depth of irrigation water. This 
equation fails to account, adequately, for the observed 
variation between arrays. 

TESTING THE HYPOTHESIS OF A CURVILINEAR RELATIONSHIP 

We may now test the hypothesis that a power curve 
of the second degree (F = a + bX + cX 2 ) defines the rela- 
tion between alfalfa yield and depth of irrigation water 
applied. The procedure is identical with that followed 
in the case of the straight line. By the method of least 
squares we determine the best values of the constants in 
an equation of the desired form. The curve is fitted to 
the means of the eight arrays, each weighted by the number 
of observations in that array. The derived equation is 
F = 3.539 + .2527 X — . 002827X 2 . The curve appears 
graphically in Fig. 82, and the computation of the sum 
of the squared deviations from it is shown in Table 122. 

The inadequacy of the fit is measured this time by the 
figure 4.61, the sum of the squared deviations from the 
power curve of the second degree. This sum, to which we 
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Table 122 

Alfalfa Yield and Depth of Irrigation Water 
(Class means and values based on a power curve of the second degree) 


( 1 ) 

( 2 ) 

( 3 ) 

Mean 

yield 

of 

class 

(tons) 

( 4 ) 

( 5 ) 

Difference 

( 6 ) 

( 7 ) 

Inches 


Estimated 

between 



of 

No. of 

yield , from 

mean yield 



water 

obser - 

equation 

of class 



(class) 

vations 

(tons) 

and esti- 
mated yield 





Y „ 

Vc 

Y P -yc 




f 



d 

d 2 

fd' 

0 

6 

3.88 

3.54 

+ .34 

.1156 

.6936 

12 

6 

5.63 

6.16 

- .53 

.2809 

1.6854 

18 

4 

6.80 

7.17 

- .37 

.1369 

.5476 

24 

6 

7.92 

7.98 

- .06 

.0036 

.0216 

30 

6 

8.98 

8.58 

+ .40 

.1600 

.9600 

36 

6 

9.27 

8 97 

+ .30 

.0900 

.5400 

48 

6 

9.02 

9.16 

- .14 

.0196 

.1176 

60 

4 

8.42 

8.52 

- .10 

.0100 

.0400 


4.6058 


may refer as Bt, is a component of B, the variation between 
arrays. It is the portion that is not accounted for by the 
hypothesis of a curvilinear relationship, of the type assumed, 
between alfalfa yield and irrigation water applied. The 
other component of B is derived by the method indicated 
in Table 123 on page 516. 

We may designate by B z the sum 147.32. This is the 
component of the variation between arrays that is accounted 
for by the hypothesis of a relationship defined by a second 
degree curve. The items in col. (3) of Table 123 differ 
from the mean of all the observations, on our present 
assumption, because alfalfa yield varies with increased 
applications of water in a manner defined by the equation 

Y = 3.539 + . 2527 A - .002827 A 2 . 

We have again broken B, the total variation between 
arrays, into two components, B 3 representing the influence 
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Table 123 


Computation of Variation in Alfalfa Yield Attributable to Irrigation 
Differences on the Hypothesis of a Non-Linear Regression 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

Inches 

No. of 

Estimated 
yield , 

equation of 
second degree 

Mean yield , 




of 

water 

obser- 

vations 

all obser- 
vations 






Vc 

Y 

Vc-Y 




f 



d 


f* 

0 

6 

3.54 

7.48 

-3.94 

15.5236 

93.1416 

12 

6 

6.16 

7.48 

- 1.32 

1.7424 

10.4544 

18 

4 

7.1/ 

7.48 

- .31 

.0961 

.3844 

24 

6 

7.98 

7.48 

+ .50 

.2500 

1.5000 

30 

6 

8.58 

7.48 

+ 1.10 

1.2100 

7.2600 

36 

6 

8.97 

7.48 

+ 1.49 

2.2201 

13.3206 

48 

6 

9.16 

7.48 

+ 1.68 

2.8224 

16.9344 

60 

4 

8.52 

7.48 

+ 1.04 

1.0816 

4.3264 

147.3218 


of the irrigation factor, working in accordance with a definite 
law, and B\ representing random factors, or random factors 
combined with the irrigation factor. (The irrigation factor 
enters into B t to the extent that the hypothesis in question 
fails to take account of the true relation between alfalfa 
yield and depth of water applied.) This is, of course, a 
different division of B from that resulting from the applica- 
tion of a linear hypothesis. The present division may be 
set down in summary. 


Nature of variability 

No. of degrees 
of freedom 

Sum of 
squares 

Mean 

square 

Between arrays, due to regression of 
second degree (Component £ 3 ) 
Deviations from second degree curve 

2 

147.32 


of regression (Component B 4 ) 
Total variation between arrays 

5 

4.61 

.92 

(Component B) 

7 

151.93 



The seven degrees of freedom entering into component B 
are now divided, five to component B 4 and two to compo- 
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nent B 3 . The reasons for this allocation of the degrees of 
freedom are similar to those presented in discussing the lin- 
ear hypothesis. As regards component B 3) the item now of 
chief concern to us, it is clear that when a curve defined 
by an equation with three constants is fitted to eight 
points there are five degrees of freedom to deviate from 
that curve. 

Dividing 4.61 by 5 we secure . 92, the value of the vari- 
ance comparable to the variance of component A. For 
again we must use a criterion based on A, in determining 
the limits within which variation due to random factors, 
independent of irrigation, may play. We come again to a 
comparison of variances. 


Table 124 


A Test of the Hypothesis of Curvilinear Relationship 


Nature of variability 


Within arrays (Compo- 
nent A) 

Deviation from second de- 
gree curve of regression 
(Component B t ) 


Degrees of Mean square Natural logarithm 


freedom 

n 

(variance) 

<T 2 

of mean square 

fog* a* 

36 

2.12 

.7514 

5 

.92 

- .0834 


Difference 

z 

= - .8348 
= - .4174 


In this case the degree of deviation from the curve of 
regression defined by the power curve of the second degree 
is actually less than the deviation within arrays, which 
serves as our yardstick. The value of z is therefore negative, 
equal to — .4174. This measure may be tested for signifi- 
cance by the methods previously discussed. The z-table 
is entered with n 3 = 36 (the number corresponding to the 
larger of the two variances), n 2 = 5. Interpolating in the 
table for these values we obtain 1.1158 as the 1 per cent 
value of z. The present value is distinctly less than this. 
The difference between the two measures of variance is 



518 


ANALYSIS OF VARIANCE, 


not significant. The departures from the curve of regression 
may be attributed to “chance,” that is, to random factors 
independent of the irrigation factor. 

In following this general procedure it is necessary to test 
different hypotheses (i.e., different functions) only until the 
difference between the variance defined by component A 
and the variance defining departures from the curve of 
regression be small enough to be attributed to the play 
of chance. Thus, if a P of .05 constitutes our standard, 
the difference between the two variances given in the pre- 
ceding table, as measured by z, might be positive and as 
great as .4536, without leading to rejection of the hypothesis 
being tested. It could be as great as . 6370 if our standard 
of significance were a P of .01. 1 A rather exceptionally 
close fit by the second degree curve we have employed gives 
us the negative value of z we have actually obtained. 

We have arrived, then, at an hypothesis concerning the 
relation between alfalfa yield and depth of irrigation water 
applied, with which observed facts are not inconsistent. 
Our observations, be it noted, do not establish the truth 
of this hypothesis. Other hypotheses might be equally ten- 
able, and perhaps even more closely in accord with the facts. 2 

1 These figures are derived from Tables VI and VII by the process of interpo- 
lation described above, with ni = 5 and n 2 = 36. (ni is taken as equal to 5, of 
course, only when B* is greater than A; n\ is always taken to represent the 
number of degrees of freedom corresponding to the larger of the two variances 
being compared.) This method of interpolation is applicable over the range of 
the 2 -table, except for the comer relating to values of ni in excess of 24 and 
values of n 2 in excess of 30. For dealing with cases in this region, R. A. Fisher 
gives the following formulas for approximating the desired quantities: 

5 per cent value of z = — ,7843^- — 

y/h- 1 \ni n t J 

i ± * * 2.3263 rtoi- / 1 1\ 

1 per cent value of z ■* —-===. — 1 . 235 ( ) 

Vh — 1 V*i n 2 / 

In these formulas h is the harmonic mean of ni and n%. That is 

— + — 

1 «l «2 

h = 2 

* We could, of course, fit a curve of still higher degree, the equation to which 
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All that we can say is that the observed facts do not dis- 
prove the hypothesis. If the hypothesis is tenable on rational 
grounds, we have reached a conclusion upon which we may 
rest, for the time. 

summary: variance analysis in the measure of 

RELATIONSHIP 

The procedure employed in the last example may be 
summarized and certain measurements presented which 
show the relation of this procedure to methods discussed in 
preceding chapters. The quantitative results are presented 
in Table 125. 


Table 125 


Component Elements of the Variability oj Alfalfa Yield , and Various 
Measures of Correlation 


Total variability of observations 
relating to alfalfa yields and 
components of this total 


Test of 
significance 


Total variability (sum of squared 
deviations from Mean) 228. 33 

1. Division of total variabil- 
ity into: 

A. Variation unrelated to 

irrigation factor (i.e., 
variation within ar- 
rays) 76.39 

B. Variation attributable 

to irrigation factor, z = 1 . 1632 

and to other causes, in 1 per cent 

indeterminate propor- value of z 

tions (i.e. variation be- = . 5780 

tween arrays) 151 . 94 


Measure of 
correlation 


Correlation ratio 
, 151.94 

V 228.33 
= .6654 

V = .82 


(Footnote £ continued from page 518.) 

contained four constants, or more, instead of the three constants in the equa- 
tion actually employed. The deviations from this curve of higher degree would 
be smaller than from the curve of second degree, and z would be correspondingly 
smaller. It is a principle of scientific procedure, however, to employ the sim- 
plest acceptable function. Needless complexities, whether in the form of un- 
necessary assumptions or of unnecessary constants in an equation of relation- 
ship, are rigorously avoided. 
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Table 125 — Continued 


Component Elements of the Variability of Alfalfa Yield } and Various 
Measures of Correlation 


Total variability of observations 
relating to alfalfa yield , and 
components of this total 

2. Division of component B 

of (1) above into: 

Bi . Variation attributable 
to irrigation factor on 
the assumption of a 
linear relationship 107 . 15 

B 2 . Variation attributable 
to irrigation factor, 
but not explainable in 
terms of a linear rela- 
tionship, and to other 
causes, in indetermi- 
nate proportions 44 . 79 

3. Division of component B 
of (1) above into: 

Bz. Variation attributable 
to irrigation factor on 
the assumption of a re- 
lationship defined by 
power curve of second 
degree 147.32 

jB*. Variation attributable 
to irrigation factor, 
but not explainable in 
terms of power curve 
of second degree, and 
to other causes, in in- 
determinate propor- 
tions 4 . 61 


Test of 
significance 


z = .6298 
1 per cent 
value of z 
= .6047 


2 = - .4174 
1 per cent 
value of z 
= 1.1158 


Measure of 
correlation 


Coefficient of 
correlation 
, __ 107.15 
r 228.33 
= .4693 
r = .69 


Index of corre- 
lation 
2 = 147.32 
P 228.33 
= .6452 
P = .80 


The meaning of this summary should be clear, with 
reference to the preceding demonstration. Component A 
of the total variability, being independent of the influence 
of the irrigation factor, is the yardstick, or standard of 
reference, which is used in all the tests of significance noted 
in the second column. Component B, in the first test. 
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is shown to be clearly greater than A, when account is 
taken of the number of degrees of freedom present in the 
two quantities. Thereafter, component B is broken into 
sub-components, first on the hypothesis that alfalfa yield 
and irrigation are related by a linear function, next on 
the hypothesis that the relationship is defined by a power 
curve of the second degree. The evidence is not consistent 
with the first of these hypotheses, and it is rejected. (The 
hypothesis would be rejected on rational grounds, as well 
as on the basis of empirical evidence.) The results are not 
inconsistent with the second hypothesis, and we may accept 
it, subject to the possibility of modification on the basis 
of later experience. 

Three abstract measures of degree of correlation between 
alfalfa yield and applications of irrigation water are given 
in the right-hand column. All of these may be derived 
directly from the quantities employed in the variance 
analysis. Study of the elements of these correlation meas- 
ures, and of the relation of the several measures to the 
corresponding hypothesis, will provide a suggestive review 
of the general problem of correlation. 

We should note here that an assumption of normality 
is implied in the comparison of standard deviations, or 
of variances, in this type of analysis. Minor departures 
from normality do not materially affect the procedure, but 
substantial departures do so. The conversion to other 
forms (such as logarithms or reciprocals) of observations 
not normally distributed in natural terms will sometimes 
yield normal distributions. Where this is possible, the 
precision of the method of variance analysis is increased 
by such conversion. Limitations arising out of material 
departures from normality may be avoided, also, by the 
use of ranks, as is done in the computation of the coefficient 
of rank correlation. Appropriate procedures have been 
developed by Milton Friedman . 1 

' “The Use of Ranks to Avoid the Assumption of Normality Implicit in the 
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Variance Analysis in Testing the Significance of 
Seasonal Fluctuations 

The methods outlined in this chapter are applicable to 
certain of the problems encountered in the analysis of time 
series. They are peculiarly appropriate in determining 
whether the seasonal fluctuations observed in a given series 
represent a true seasonal pattern. Apparent seasonal move- 
ments would be present in any series of observations covering 
a period of years, by months. Chance factors would create 
some differences between averages of all the Januaries, all 
the Februaries, etc., even though no true seasonal movement 
existed. We require an objective test, to be used in deter- 
mining whether the differences among such monthly aver- 
ages are significant or not. 

The entries in the body of Table 126 are the figures 
obtained when freight car loadings by months, for the 
period 1918-1927, are expressed as percentages of linear 
trend values. (The original data are given in Chapter VIII.) 
The arithmetic mean of the ten items for January appears 
in the bottom row, with similar means for the other months. 1 
The test for seasonality involves answering the question: 
Do these means differ significantly from 99.9867, the average 
value, of the 120 items in the table? In seeking to answer 
this question we must break the total variance of the freight 
car loadings data into its elements. We wish to define that 
portion of the total variance apparently due to seasonal 
movements. This may then be appraised with reference to a 
yardstick representing what we may call the residual vari- 
ability of the series. 

In computing the total variance we may make use of the 

Analysis of Variance,” Journal of the American Statistical Association, Vol. 32, 
Dec. 1937, 675-701. 

1 These means, it may be noted, are not precisely the same as the seasonal 
indices given in Chapter VIII. In seeking to improve the representativeness 
of the monthly indices, only the four central items for each month were used in 
the averaging process employed in that chapter. Here it is necessary to employ 
the arithmetic mean of all the items for each month. 
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familiar relation = ~~jy~ ~ c 2 where d is the deviation 

of an observation from the true mean, d' is the deviation 
from an assumed mean, and c is the difference between true 
and assumed means. In this case we take the assumed 
mean at 0 on the original scale, and c is thus equal to the 
mean. Since we wish to work with sums of squared values, 
we use the relationship 

2d 2 = 2(d') 2 - Nc 2 . 

(The mean should be computed to more places than are to 
be finally retained, since the process of squaring and multi- 
plying by N greatly magnifies even slight errors.) 

The entries in col. (16) of Table 126 are the sums of 
the squares of the items in the body of the table. Inserting 
the proper values in the above formula, we have 

2 d 2 = 1,213,250.14 - 120 X 99.9867 2 
= 1,213,250.14 - 1,199,680.82 
= 13,569.32. 

As in the alfalfa problem discussed above, this total 
may be broken into an element representing variance 
between the monthly means and variance within the several 
months. (Reference here is to the columns of Table 126.) 
The variance between months may be computed directly 
from the monthly means. 

Thus: 

Sum of squares of deviations of monthly means from grand mean 
- 10 X (99.9867 - 90. 60) 2 + 10 X (99.9867 - 92. 04) 2 
+ 10 X (99.9867 - 96. 38) 2 + . . . 

+ 10 X (99.9867 - 88.35 ) 2 . 

That is, the deviation of each monthly mean from the 
grand mean is squared and weighted by the number of 
items represented by that mean; the sum of the twelve 
figures thus obtained is the required measure of variability 
between months. 
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An alternative shorter method may be employed in 
determining the variance between months, utilizing the 
relationship 

2d 2 = 2 (d'y - Nc\ 

Here each d' is the mean value for a given month. Each 
squared value must be weighted by the number of items 
represented by the mean. Thus 

2 (fL'Y = 10(90. 60) 2 + 10(92. 04) 2 + 10(96. 38) 2 + . . . 

+ 10(88. 35) 2 

= 1,207,068.40. 


The correction factor, Nc 2 , is the same as in the first opera- 
tion. We have, then, 

Sum of squares of deviations of monthly means from 
grand mean = 1,207,068.40 — 1,199,680.82 

= 7,387.58. 

This sum measures that portion of the total variability 
that may be attributed to seasonal fluctuations. Is it 
significant or does it merely reflect the play of the mass 
of undifferentiated factors we call chance? 

In answering a similar question concerning the alfalfa 
problem we used as a yardstick the variability independent 
of the one factor the effects of which were being studied — 
namely, irrigation differences. In the present case we could 
obtain a measure that is independent of seasonality by 
computing the variability within the several columns of 
Table 126. That is, each item in col. (2) could be deducted 
from the January mean, 90 . 60, and the sum of the squared 
deviations in this column obtained; a similar sum could 
be obtained for each of the other columns numbered from 
(3) to (13). The grand sum of these figures would be the 
variability within arrays — variability clearly independent 
of seasonal forces since only differences among items for 
the same month enter into it. This sum, measuring varia- 
bility within columns, has a value of 6,181.74. The 
variability between columns plus the variability within 
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columns is, of course, equal to the total. That is, 
7,387.58 + 6,181.74 = 13,569.32. 

The measure of variability within columns will not serve 
in the present instance, however. The yardstick should be 
a measure of the variability due to “chance” — to the 
play of a mass of random factors which may not be observed 
and measured individually. Effects that can be clearly 
attributed to specific causal forces should not be included 
in the yardstick. But some of the variability within months 
may be clearly assigned to changes associated with the 
classification by years. The average of the 12 monthly 
items for 1918 is 108.09; that for the 12 months in 1921 
is 87.63. The former was a year of prosperity, the latter 
one of depression. Clearly, some of the differences among 
the items in the January column, or in the May column, 
are definitely attributable to cyclical forces that raise all 
the monthly figures for one year and depress all the monthly 
figures for another year. (The influence of trend is not 
present, since the items in the body of the table are actual 
values expressed as percentages of trend.) The variability 
within months should be corrected by the subtraction of 
that portion of it that may be attributed to factors affecting 
yearly conditions as a whole. 

The influence of cyclical and other forces affecting whole 
years is measured by differences between the averages for 
1918, 1919, 1920, and the other years covered. These 
averages are given in col. (15) of Table 126. The desired 
quantity may be obtained by the precise methods used in 
measuring the variability between months. We have 

2d 2 = 2(d') 2 - iVc 2 
or, 

Sum of squares of deviations of yearly averages from grand mean 
= (12 X 108. 092 2 + 12 X 98.542* + 12 X 103.267* + . . . 

+ 12 X 98.383*) - 120 X 99.9867* 
= 1,203,537.88 - 1.199,680.82 
- 3,857. 0«, 
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(There will, of course, be ten squared items within paren- 
theses, one for each of the ten years covered by the data.) 
Subtracting 3,857.06 from 6,181.74, the measure of total 
variability within the columns, we have 2,324.68 as the 
balance. This is the desired yardstick. It measures that 
portion of the variability among the original items which 
is clearly independent of the seasonal influence. Secondly, 
it has been corrected by the subtraction of that portion 
of the variability within months which is attributable to 
cyclical and other factors responsible for broad changes 
from year to year. The final balance represents the play 
of forces independent both of seasonal movements and of 
broad swings affecting each yearly value as a whole. This 
residual variability, measured by the figure 2,324 . 68, reflects 
the play of all those random, undifferentiated factors we 
lump together as chance. 1 

This residual variability may be most readily computed by 
subtracting from the total variability the two figures measur- 
ing, respectively, variability between the means of the months 
and variability between the means of the years. At this stage 
of the computation these figures will be in the form of sums 
of squared deviations. The form of organization employed in 
Table 126 on page 528 is convenient for these calculations. 

In the application of the test of significance, account 
must be taken of the number of degrees of freedom entering 
into each of these measures of variability. Table 127 indi- 
cates a suitable procedure. 

1 When, as in the present example, the influences of the two variables, or 
principles of classification, are independent, it is valid to use the residual vari- 
ability thus computed as a measure of the strength of random factors. If these 
influences are not independent (if, in terms of the above example, seasonal 
movements affecting the monthly averages and cyclical movements affecting 
the annual averages should be correlated), the residual quantity will not 
be an accurate measure of truly random factors. When the residual quantity 
which is used as the yardstick in variance analysis is derived from observa- 
tions that are alike in respect of both principles of classification (i.e., when 
the quantity measures variance within cells obtained by the application of a 
two-fold principle of classification) this difficulty does not arise. An example 
of this type is given in Appendix E. 
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Table 127 


Analysis of Variance of Freight Car Loadings and Test of Seasonality 


(1) 

(2) 

(3) 

(4) 

(5) 

Nature of 

No. of degrees 

Sum of 

Mean square 
(variance) 

<r 2 

(3) - (2) 

Natural logarithm 

variability 

of freedom 
(n) 

squares 

of mean square 
log e <r 2 

Between means 

of years 

9 

3,857.06 



Between means 

of months 

11 

7,387.58 

671.598 

6.50970 

Residual varia- 

bility 

99 

2,324.68 

23.482 

3.15627 

Total 

119 

13,569.32 

Difference = 3.35343 
3.35343 


= 1.67671 

The item 3,857.06 measures the degree of difference 
between 10 yearly averages. Nine degrees of freedom are 
represented in this figure. (The use of weights in computing 
the sum of the squares does not affect the number of degrees 
of freedom.) Similarly, 11 degrees of freedom are repre- 
sented in the measure of variability between the 12 monthly 
means. The total variability is computed from 120 items, 
so there are 119 degrees of freedom in all. The number 
of degrees of freedom in the residual variability is, therefore, 
119 - (11 + 9), or 99. 

The variance between the means of months (i.e., the 
mean square) is 671.598. The residual variance is 23.482. 
The tost of seasonality reduces to the question: May the 
variance between the means of months be attributed to 
the random forces responsible for the residual variance? 
Unless the variance between the monthly means is signifi- 
cantly greater than the residual variance, no significance 
may be attached to the observed differences between the 
averages for January, February, March, and the other 
months. 
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The test is applied with reference to the measure z, 
which is equal to half the difference between the natural 
logarithms of the two variances being compared. From 
the entries in Table 127 we compute z as equal to 1.67671. 
Referring to Appendix Table VI we find that for »i = 11 
and n 2 = 99, the 1 per cent value of z is approximately . 44. 
The present value is distinctly greater than this. The 
results are not consistent with the hypothesis that the 
true value of z is zero. There is clear evidence of the existence 
of a definite seasonal pattern in freight car loadings. 1 

The same yardstick may be applied in testing whether 
the differences between the yearly averages are significant. 
The rather wide variations from year to year in the average 
values of the items in the body of Table 126 represent, 
presumably, the play of cyclical forces plus major “acci- 
dental fluctuations” affecting yearly totals. (The trend 
factor, had it not been eliminated, would have combined 
with these other two to create differences among the yearly 
totals.) But are these year-to-year differences great enough 
to be attributed to definite forces other than the chance 
factors represented in the residual variance we are using 
as yardstick? 

The variance between means of years is equal to 
3,857 .06 -f-9, or 428 . 562. Is this significantly greater than 
23.482, the residual variance? Following the procedure 
illustrated in Table 127 we obtain 1.35352 as the value 
of z. The 1 per cent value of z, for Wi = 9, n 2 = 99, is 
approximately .47. The test indicates that the differences 
between the annual averages are due to definite forces 
other than the random factors represented in the residual 
variance. 

1 In the test here applied we are proceeding on the assumption that the 
seasonal pattern is constant from year to year. If it is not constant, the ac- 
curacy of the residual variability, as a measure of “chance” factors, and of 
the measure of variability between months will be affected, and the signifi- 
cance of the results will be lessened. If there is reason to believe that seasonal 
movements have changed over the period covered, tests of the kind suggested 
in Chapter VIII should precede the tests here discussed. 
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CHAPTER XVI 


THE MEASUREMENT OF RELATIONSHIP: 
MULTIPLE AND PARTIAL CORRELATION 

In dealing with methods of defining correlation in the 
preceding chapters we have been concerned with problems 
involving only two variables, a dependent variable and a 
'single independent variable. We have found, in certain 
cases, a fairly high degree of correlation between the two 
variables studied. But it is obvious that, in general, 
economic phenomena are affected by more than one factor, 
that the fluctuations in a single variable may be due to 
the interaction of many forces. In dealing with just two 
variables all other factors are ignored, on the assumption, 
usually, that in the single independent variable are found 
the most important causes 1 of fluctuations in the dependent 
variable. Thus, in the alfalfa example given, the effect 
upon yield of but a single factor, irrigation, was studied. 
Yet variations in rainfall and temperature must have 
affected the yield in the different years studied. Similarly, 
variations in practically every factor dealt with in economic 
analysis are traceable to more than one cause. If our 
analysis is to be complete we must employ methods which 
will enable more than two variables to be handled at a 
time. We need instruments that will assist us in measuring 
the combined effect upon a single variable of a number 
of factors. Such instruments may be secured by a simple 
extension of methods already familiar. 

In Table 128 are presented figures showing the yield 
of corn, per acre, in Kansas from 1890 to 1933, together 

1 This should not be taken to mean that the coefficient of correlation meas- 
ures or establishes causal relationships. 
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Table 128' 


Corn Yield and Temperature in Kansas , 1890-1933 1 


(1) 

(2) 

(3) 

(4) 

. (6) 

Average 

Average 

Average 

Average 


yield per 

June 

July 

August 

Year 

acre } in 
bushels 

tempera- 

tempera- 

temperar 


ture 

ture 

ture 


X i 

Xi 

Xz 

X 4 

1890 

15.6 

77.6 

83.1 

76.1 

1891 

26.7 

70.7 

74.0 

75.1 

1892 

24.5 

73.4 

77. 5 

76.5 

1893 

21.3 

74.7 

79.5 

73.8 

1894 

11.2 

74.2 

77.8 

78.0 

1895 

24.3 

71.7 

74.9 

76.0 

1896 

28.0 

74.1 

78.1 

78.7 

1897 

18.0 

76.6 

80.2 

76.0 

1898 

16.0 

75.0 

77.7 

78.2 

1899 

27.0 

73.9 

76.2 

80.6 

1900 

19.0 

74.9 

77.9 

81.0 

1901 

7.8 

77.3 

85.0 

79.1 

1902 

29.9 

70.9 

76.8 

78.2 

1903 

25.6 

67.2 

78.3 

75.3 

1904 

20.9 

70.4 

75.6 

74.6 

1905 

27.7 

75.5 

74.5 

78.7 

1906 

28.9 

71.8 

73.8 

76.3 

1907 

22.1 

72.0 

78.4 

78.1 

1908 

22.0 

72.1 

75.8 

76.2 

1909 

19.9 

73.1 

78.1 

80.1 

1910 

19.0 

72.2 

79.5 

75.7 

1911 

14.5 

80.5 

78.6 

76.4 

1912 

23.0 

69.3 

79.9 

77.4 

1913 

3.2 

74.2 

82.1 

84.2 

1914 

18 5 

78.2 

79.9 

78.2 

1915 

31.0 

69.2 

74.0 

70.1 

1916 

10.0 

70.3 

81.2 

79.6 

1917 

13.0 

72.8 

80.8 

73.4 

1918 

7.1 

78.4 

78.3 

82.3 

1919 

15.2 

72.3 

80.2 

78.3 

1920 

26 5 

72 8 

77.6 

72.9 

1921 

22.2 

74.4 

79.2 

78.6 

1922 

19.3 

75.2 

77.0 

80.1 

1923 

21.7 

73.3 

79.4 

78.3 

1924 

21.7 

74.3 

75.1 

79.0 

1925 

16.6 

77.7 

79.7 

77.4 

1926 

11.0 

72.5 

78.4 

79.1 

1927 

30.0 

70.9 

76.9 

73.1 

1928 

27.0 

67.7 

78.1 

77.1 

1929 

17.5 

72.2 

78.8 

78.9 

1930 

12.0 

73.1 

81.7 

80.3 

1931 

17.5 

78.1 

80.6 

76.1 

1932 

18.5 

74.3 

81.8 

79.2 

1933 

11.5 

80.5 

81.4 

76.8 


1 The data of com yield are from Bulletin 515, U. S. D. A., and from the 
Yearbooks of the U. S. D. A. Temperature data are from reports of the U. S. 
Wither Bureau for Dodge City, Concordia and Jpla. 
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with the average June, July, and August temperatures for 
each of these years. 

The Relation between Corn Yield and Temperature: 

Preliminary Analysis 

It is known that corn yield is affected by the temperature 
during the growing season. The object of the present 
study is the determination of the precise relation between 
yield and temperature during each of the three months 
given, in order to secure a basis for estimating the yield 
from a knowledge of the temperature. As certain growing 
months are more important than others, the relation 
between temperature and yield may be determined, first, 
for each of the three months separately. 

The equation which describes the relationship between 
yield per acre and June temperature will be of the type 

Xi = a - |- &12X2. 

The equation describing the relationship between yield per 
acre and July temperature will be of the type 

Xi = a -|~ 613A3. 

(In each case Xi represents average corn yield per acre, 
for the State, while X 2 , X 3 , etc., represent the absolute 
temperature, in degrees Fahrenheit.) Instead of using to 
represent the variables the symbols Y and X, as in the 
preceding examples, X u X iy X 3 , etc., are employed, Xi 
representing in this case the dependent variable. The 
symbol for the constant representing the slope (the coefficient 
of regression) is, in the first instance above, 612 . The 
subscripts 1 and 2 indicate the variables to which this 
constant refers, the first subscript always representing the 
dependent variable ( Xi in the example cited), the second 
the independent variable (X 2 in the illustration above). 
These subscripts are necessary to distinguish the different 
constants when several variables enter into the problem. 



534 


MULTIPLE CORRELATION 


The meaning is precisely the same as in the former examples 
when no subscripts were needed because only two variables 
were dealt with. 

Solving the proper normal equations for the constants 
in the equation which describes the average relationship 
between yield per acre and June temperature, we have 

Zi = 100.35 - 1.096X 2 . 

The value of S n may be determined from the formula 
O, 2(X: 2 ) - ah(Xi) - bn S(X,I 2 ) 

if 

(The subscripts to S, and those to r which appear below, 
have the same meaning as those employed in the preceding 
paragraph.) Substituting the given values, and solving, 
we have 


and 


S* a = 33.593 
(S 12 — 5 . 80. 


The significance of the standard error, S, as a measure 
of the reliability of estimates based upon the equation of 
relationship, has been fully explained. In judging of the 
usefulness of the equation, Sn should be compared with C\ 
(the standard deviation of Xi) which may be looked upon 
as a measure of the reliability of estimates based upon the 
arithmetic mean of the variable Xi. For this we have 


<?i = 6 . 68 . 

Clearly, the estimates from the equation are more reliable 
than those based upon the mean. The coefficient of correla- 
tion, r, expresses this relationship in abstract terms. We 
may get this value from the equation 

_ aS(Xi) + bnZpCJQ - NcS 
12 S(X! 2 ) - Nci 2 

Solving for r, and giving it the sign of bn, we have 

n 2 = — .4984. 
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These values indicate a negative correlation, though not 
a high one, between yield per acre of corn and June tem- 
perature in Kansas. Let us see if the estimates could be 
improved if based upon the temperature in July instead 
of in June. 

The values needed in this study may be computed from 
Table 128. Solving for the constants in the equation of 
regression, we secure the equation 

Xi = 166.07 - I. 866 X 3 . 

For the standard error, we have 

S 13 = 4.81 

and for the coefficient of correlation 


r 13 = — .6948. 

We have here a closer relation and a better basis for 
estimates than in the case when June temperature was 
considered. 

Repeating the process for yield per acre and August 
temperature, we have 

Xi = 119.45 - I. 288 X 4 
S u = 5.78 
r 14 = — .5013. 

August temperature, it is evident, also affects the corn 
yield in Kansas, a low temperature conducing to yield 
above normal. The relationship is not so close as in the 
case of July temperature, but it is still significant. What 
is needed now is some method of combining these three 
factors, in order that an estimate may be based upon a 
knowledge of their influence, in combination, upon the 
yield of com. The addition or averaging of the temperatures 
in the three months will not do, for July is obviously more 
important than either of the other months. The principle 
of the method by which this may be accomplished is simple. 
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The Estimation op Corn Yield from Three 
Independent Variables 

The estimating or regression equation in the present case 
will be one in which there is a single dependent variable 
(corn yield) and three independent variables. It will be 
of the form 


Xi = a + 612.34X2 + 613.24X3 + 614.23X4. 

If we can determine the values of the four constants, we 
may substitute given values of X 2 , X 3 , and X 4 in the equa- 
tion and thus get an estimate for Xi in precisely the same 
way as when two variables are dealt with. The method 
of least squares affords the means of solving for the required 
constants. 

The symbols require a word of explanation, as a perfectly 
simple equation is given a rather ponderous appearance 
by all the subscripts employed. The symbol bn, it has been 
explained, represents the coefficient of regression of X x on X 2 
(i.e., the slope of the line describing their relationship, Xi 
being dependent) when these two variables alone are 
included in the study. The symbol 612.34 represents the 
coefficient of net regression of Xi on X 2 . The addition of the 
subscripts 3 and 4 to the right of the period means, simply, 
that the variables X 3 and X 4 have been included in the 
study and the effects of their variations eliminated, in so 
far as this one constant (612.34) is concerned. This constant 
measures the weight which must be given to the variable 
X 2 in an estimate of Xi based upon the three independent 
variables, X 2 , X 3 , and X 4 . It will not, of course, be the 
same as 612, which indicates the weight given to X 2 when 
an estimate of Xi is based upon X 2 alone. Similarly the 
constant 613.24, the coefficient of net regression of Xi on X 3 , 
measures the weight given to X 3 when X 2 and X 4 are also 
included. Each coefficient represents a single, simple con- 
stant, but the subscripts are necessary in order that the 
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precise meaning of this constant may be clear. The subscripts 
to the left of the period are termed primary subscripts, 
those to the right secondary subscripts. 

FORMATION AND SOLUTION OF THE NORMAL EQUATIONS 

The first task 1 is the securing of the normal equations 
required in solving for the constants in the estimating 
equation given above. Following the usual procedure 2 we 
have: 

I 2(Xi) = Na + 6«. 34 2(X 2 ) + b 13 . 2 4 Z(X 3 ) + 6„. 2 32(X 4 ) 

II Z(X 4 X 2 ) = aZ(X 2 ) + b 12 . 3i Z(X 2 *) + & 13 . 24 2(X 2 X 3 ) 

+ 6 14 .232(Z 2 X 4 ) 

III 2(XiA' 3 ) - a2(X 3 ) + 6 12 . 34 2(X 2 X 3 ) + 6 13 . 24 2(X 3 2 ) 

+ 6i 4 . 23 2(X 3 X 4 ) 

IV 2(X t X 4 ) = aZ(X 4 ) + & 12 . 34 Z(X 2 X 4 ) + 6 13 . 21 2(X 3 X 4 ) 

+ 6i4. 23 2(X 4 2 ). 

The given values might be substituted in these simul- 
taneous equations and solutions secured directly for the 
four constants. It is possible to reduce the number of 
normal equations by one, however, and thus lessen mate- 
rially the labor of computation. This is done by using 
deviations from the arithmetic mean for each variable 
instead of absolute values, getting rid in this way of the 
constant term a in the original equation. 

If we let A i, A 2 , A 3 , etc., represent the arithmetic means 
of the different variables while x lf x 2 , x 3 , etc., represent 
deviations from the means, we may replace the absolute 
numbers X h X 2> X 3) etc., by their equivalents, x t + A h 
x 2 + A 2 , x 3 + A 3 , etc. Making these substitutions in the 
normal equations, certain algebraic simplifications are pos- 

1 The approach to the problem of multiple correlation which is here taken 
follows that of H. R. Tolley and M. J. B. Ezekiel “A Method of Handling 
Multiple Correlation Problems,” Journal of the American Statistical Associa- 
tion, December, 1923, 993-1003. 

2 Cf. Appendix A for a discussion of this procedure and of the methods em- 
ployed in simplifying the normal equations. 
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sible which eliminate the first of the normal equations, 
and reduce the others to the following form: 


2(XjXj) 

N 

2(xix*) 

N 

Sfriau) 

N 


2(z 2 2 ) t , 2(xiX») t t 

O12.34 i Tf 018.24 i °14.23 


N 

2(xtXa) 

N 

2(x 2 Xi) 

N 


&12.34 + 


-' 12.34 


+ 


N 

Z(*3 2 ) h 
N 0 

SfxsXi) 

N 


+ 


+ 


N 

2(x 3 x 4 ) 

N 

S(*4 8 ) 


-14.28 


N 


b 14 


23- 


All the variables in the above equations refer to deviations 

from the respective arithmetic means. Therefore — ^ — 

is simply the mean product of the variables X\ and x 2 , 

- is <7 2 2 , etc. Representing the various mean products 

by the symbols p i2) p u , etc., and inserting the symbols 
for the squares of the standard deviations, we secure, for 
the normal equations : 


Pl2 = 02 2 &12.34 4* P23&13.24 + Ptibu.23 
Pii = P 2 3?>12.34 + 0'3 2 i>l3. 2 4 + P34&14.23 
Pl4 = P24&12.34 + P34&13.24 + 04 2 5l4.23. 

This is the most convenient form for the solution of the 
normal equations. 

From the data, as arranged in Table 128, the following 
values are derived: 


2(Xi) = 863.9 

2(X 2 ) = 3,241.5 
2(X 3 ) = 3,453.4 
2(X 4 ) = 3,409.1 


2(Xi 2 ) = 18,928.17 
2(A 2 2 ) = 239,209.57 
2(X 3 2 ) = 271,317.92 
2(X 4 2 ) = 264,433.19 


2(ZiX 2 ) = 63,198.42 
2(ZiA 3 ) = 67,295.48 
2(XiX4) = 66,550.84 
2(Z 2 X3) = 254,544.98 
2(X 2 X 4 ) = 251,246.54 
2(M4) = 267,649.61 
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Cl 


Ct 


Z(*i) 

N 

19.6341 

73.6705 


c 8 = 78.4864 
c 4 = 77.4795 


Ci 2 = 385.4979 
C2 S = 5,427.3426 
eg 2 = 6,160.1150 
c 4 2 = 6,003.0729 


From these values, the quantities necessary for the soIu j 
tion of the normal equations may be readily determined. 
These quantities are brought together below: 


<Tl 2 = —a r Ci 2 


N 

18,928.17 

44 


- 385.4979 = 44.6878 


<rj 2 = -? ? 2 9 ?— ' - - 5,427.3426 = 9.2385 

= 271 »3 17- 92 _ 6 60 n50 = 6 2Q13 
44 

<r 4 2 = 264,43 , 3 • 19 _ 6,003.0729 = 6.7723 


Pa — 


44 


N 


C1C2 


V is = 

Pl4 = 

?23 = 

P24 = 

P34 = 


63,198,4:2 - 1,446.45396 = - 10.1263 
44 


67,2 9 5 .4 8 _ 1541 0098 = _ n .5671 


44 


66,550 ^84 _ j 521.2403 = - 8.7213 
44 

254,544.98 _ 5 782i1 3 2 3 = 2.9808 
44 

251 ,946 - 54 - 5,707.9535 = 2.1951 
44 

267,949 ~ 61 - - 6,081.0870 = 1.8586. 
44 
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Substituting in the normal equations, we have: 

- 10.1263 = 9 . 23856,2 3 4 -f- 2.98086 is.24 -1- 2. 19516m m 

- 11.5671 = 2.98086i2.34 + 6.20136, 3 .24 + 1.85866,4.22 

- 8 . 7213 = 2 . 19516,2.34 + 1 . 85866 13 .24 + 6 . 77236,4.2s. 

Solving these simultaneous equations 1 we secure the fol- 
lowing values for the constants: 

612.34 ~ — 0.460 613,24 = 1.420 614,22 ~ 0.749. 

The required equation is, therefore, 

Xi — — 0.460x2 — 1.420x3 — 0.749x4. 

This is the equation of regression of Xi on x 2 , x 3 , and x 4 . 
Any given values of the three independent variables (June 
temperature, July temperature, and August temperature) 
may be substituted in this equation, and the most prob- 
able value of the dependent variable (corn yield per acre) 
determined. In the equation as it stands, it should be noted, 
all the variables are expressed as deviations from their re- 
spective arithmetic means. For practical purposes it is ad- 
visable to have an equation in terms of the original values. 
In other words, it is desirable to shift the origin from the 
point of averages to the zero point on the original scales. 
This necessitates re-introducing the constant term a. 

The value of a may be determined from the equation 

A\ = a + A2612.34 + A3613.24 + A4614.23 
where the A’s represent the respective arithmetic means. 2 
Inserting the proper values, we have 

1 Any method of solution may be employed. Perhaps the most convenient 
with three or more equations is the Doolittle method. This is explained in 
detail in Appendix A. 

2 This equation is derived from the first normal equation, as given on p. 537 , 

2(Ai) = Na + 6i2.34S(X2) + bu.u 2 (Xz) + &14.232CX4). 

Replacing the absolute numbers Xi, X 2 , etc., by their equivalents x x + Ai, 
x% + A 2, etc., we secure 

2(xi) + NA\ *= Na + &i2.34[2(z2) + NA2] + &i 3 . 24 [ 20 rs) + ATAj] 

+ &i4.2s[S(a;4) + ATAJ. 

Since 2(a:i) = 0, Sfo) =* 0 , etc., these values disappear. Dividing through by 
N we obtain the equation presented above. 
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19.6341 = a + 73 . 6705( — 0.46005) + 78.4864 (- 1.41967) 

+ 77.4795(— 0.74910). 1 

Solving, 

a = 222.99. 

The equation of regression in terms of original values is, 
therefore, 

X, = 222.99 - 0 . 460X, - 1.420X 3 - 0.749X 4 . 

COMPUTATION OP THE STANDARD ERROR OF ESTIMATE 

Are estimates based upon this equation any more reliable 
than those based upon the equations previously derived, 
each of which referred to a single independent variable? 
To answer this question the value of the standard error 
must be computed. This will be represented in the present 
case by Si. 234 , the subscripts referring to the single dependent 
variable (Xi) and the three independent variables. This 
value may be computed from the formula 2 

1 The arbitrary origin is at zero on each of the original scales, hence A\ = Ci, 
A 2 = c 2 , etc. To ensure greater accuracy in solving for a, the values of the co- 
efficients 612.34, 613.24, etc., are given to a greater number of decimal places than 
in the equation of regression. 

2 This formula may be derived as follows: Given an equation of the type 

X \ = 612.34^2 + 613.24^3 + 614.23^4 

(in which the variables refer to deviations from the means) each residual may 
be computed from the equation 

d — 612.34^2 + 613.24X3 + 614. 23-E4 #1* (X) 

Multiplying throughout by d } and adding, we have 

2(d 2 ) = 612.342(^X2) + 613.242(^X3) + 614.232(^X4) — 2(dfx 1) 
but it follows from the method of fitting that 

S (dx 2 ) = 0 
2(dx 3 ) = 0 
2 (dx A ) = 0 

and, therefore, 2 (d 2 ) =* — S(dxi). (2) 

Multiplying each residual equation (1) by xi and adding, we have 
X(dx i) = 612.342 (X1X2) + 613.242 (X1X3) + 61 4 . 232 (X1X4) — 2(xi 2 ). 

Substituting the equivalent of 2(dxi) in equation (2) we secure 

( Footnote continued on page 
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5*1.234 = <Ti 2 — bi2MPl2 ~ &13.24P1S — buMPu. 

Substituting the proper values, we have 

£ 2 1.234 = 44.6878 - 4.6586 - 16.4215 - 6.5331 
= 17.0746 
54.234 = 4. 13. 1 

This is to be interpreted just as the standard error of estimate 
was interpreted in previous cases. The reliability of estimates 
based upon the mean value of Xi is measured by cri, which 
has a value of 6.68. The reliability of estimates based 
upon the equation of net regression, when yield is considered 
as a function of temperature in June, July, and August, is 
measured by 5i. 2 34 which has a value of 4 . 13. It is clear 
that estimates made from the equation are distinctly more 
reliable than those based upon a knowledge of Xi alone. 
We have by no means accounted for all the factors that 
are responsible for variability in corn yield, but we have 
measured and reduced to precise terms the effect of three 
factors upon the yield of corn per acre in Kansas. 

( Footnote 2 continued from page 5^1.) 

2 (d 2 ) = 2 (xi 2 ) — 612 . 342 (^ 1 X 2 ) — 6i3.242(x!X 8 ) — &i4.2a2(xix 4 ) 

2(d 2 ) 2(xi 2 ) Z(xix 2 ) 2(xix.) . 2 ( 2184 ) 

8 l Ui ~N~ = "N ~ 

Since the variables refer to deviations from the means, we have 

£ 2 1.234 — <7i 2 — 612 . 34 P 12 ~ 613 . 24 P 13 ~ &14.23P14* 

See Appendix A for a general derivation of these relations. 

1 For precise work, when the sample is small, allowance should be made in 
computing S for the number of constants in the equation of regression. Since 
there are four constants in the present equation, the 44 observations have but 
40 degrees of freedom to deviate from the computed values. Denoting by S 
the corrected value of the standard error of estimate, and by m the number of 
constants in the equation of regression, Ezekiel gives 



applying this correction to the present measurements, we have 

S’„ s4 = 17.074 6 ( 4T ^i- 4 ) 

= 18.782 
$i.2«4 = 4.33. 
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THE COEFFICIENT OF MULTIPLE CORRELATION 


We have need now of our third measure, the abstract 
coefficient of correlation. The value of this coefficient, as 
we have seen, depends upon the relation between S and a. 
It may be computed in the present instance from the 
formula 

D, _ i 5*1.234 

it 1.234 — i a~~ 9 

01 

When the relationship between a single dependent variable 
and several independent variables is being studied, this 
measure is termed the coefficient of multiple correlation 
and is represented by the symbol R. The subscript to 
the left of the period relates to the dependent variable, 
while those to the right relate to the independent variables. 
Substituting in this formula the equivalent of $* 1 , 234 , we have 

, ffj 2 — & 12 . 34 P 12 — bu.uPli — 2 > 14 . 2 sPl 4 

It 1.234 = I ~n 

Vl 


which reduces to 1 


■R*1.234 = 


&12.34P12 ~t~ l>13.24Pl3 + bltZiPu 

<T 1* 


Inserting the proper values we have 

4 . 6586 + 16.4215 + 6. 5331 
R 1 234 ~ 44.6878 

fi*1.234 = .6179 
fil.334 = - 786. 

For the same reason that estimates of p computed from 
samples must be corrected by making allowance for the 
number of constants in the regression equation, correction 


1 The coefficient of multiple correlation may also be derived from the general 
formula, which refers to an origin at zero on the original scales. This general 
formula is 
fi* 1 . 1 a . . . n 

oS(Xi)+6u.m . • . . . • wS( 3LiX))~Hin,ii . . . wS(XiXi)~H . . . Net 2 

= 2(Xi) a - AW 
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must be made in R. For if the number of constants is 
equal to the number of observations, R will necessarily 
equal 1. Using R to denote the corrected coefficient of 
multiple correlation and m to denote the number of con- 
stants in the equation of regression, Ezekiel gives 

In the present example 

R* = l- {(1- - 617 9)(H^i)} 

__ = .5892 

R = .768. 

In later references to this illustration the uncorrected 
measure is used, though it is to be understood that the 
corrected measure provides a somewhat closer approximation 
to the true R than does the uncorrected coefficient. 

The coefficient of multiple correlation is an index of the 
degree of relationship between a single dependent variable 
and a number of independent variables, in combination. 
It measures the degree to which variations in the dependent 
variable are related to the combined action of the other 
factors. Its significance may be clearer if all the independent 
variables are looked upon as constituting a single independ- 
ent series. The coefficient is then seen to be a measure 
of the relationship between the dependent variable and the 
independent series, which is precisely what the coefficient 
of correlation is in the simpler case of two variables. In 
the multiple case the independent series has several com- 
ponent elements, but this fact does not alter the essential 
significance of the coefficient. No positive or negative sign 
is attached to R, it should be noted. In the present instance 
all of the independent variables are negatively correlated 
with corn yield, and a negative sign might be attached. 
The correlation could be positive, however, for some of 
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the independent variables, and negative for others. Because 
of this fact, R is always given without sign. The signs of 
the constants in the equation of net regression show which 
of the independent variables are positively correlated and 
which are negatively correlated with the dependent variable. 

The sampling error of the coefficient of multiple correlation 
may be estimated from the formula 


cr 


1 - fl 2 
■s/N — m 


where m is the number of constants in the equation of 
regression. A more accurate test of the significance of R 
may be applied with reference to Fisher’s 2 -table, discussed 
in Chapter XV. The deviations of actual from computed 
values serve as a yardstick for testing the variability in 
Xi that is attributable to X 2 , X 3) and X it as the relationship 
is defined by the equation of regression. In common with 
other correlation problems, this one reduces to a comparison 
of variances. 

The sum of the squares of the deviations of the observed 
values of X x from the computed values is 751.2824. The 
sum of the squares of the deviations of the computed 
values of Xi from the mean value of Xi is 1,214.9808. 
Since there are 44 observations, and since the equation 
of regression contains four constants, there are 40 degrees 
of freedom in the deviations from the regression function. 
The three coefficients of regression (other than the con- 
stant a) give three degrees of freedom to variation among the 
computed values of X u The test takes the following form. 


Nature of variability 

Variation among computed 
values 

Deviation of observed from 
computed values 


Degrees of 
freedom 


Sum of 
squared 
deviations 


Mean 

square Log , <r* 

<r s 


3 1,214.9808 404.9936 6.0039 

40 751.2824 18.7821 2.9329 

43 1,906 . 2632 Difference = 3.0710 

«■= 1.5355 
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For Wi = 3, n 2 = 40, the 1 per cent value of z, as derived 
from Appendix Table VI, is .7308. The present value is 
greatly in excess of this. The variation in Xi attributable 
to the influence of X 2 , X 3 , and X t is clearly greater than 
the residual variability here used as the yardstick. The 
measure of correlation, R, is unquestionably significant. 


COMPARISON OF MEASURES OF RELATIONSHIP 

The degree to which our knowledge of the causes of 
variation in corn yield has been improved and the reliability 
of our estimates increased by taking account of the various 
factors in combination may be more readily appreciated 
if we bring together the various measures secured in the 
course of this analysis. 


Table 129 


A Comparison of Certain Measures Pertaining to the Com 
Yield in Kansas 


Measure of 

Basis of estimate reliability 

of estimate 


Arithmetic mean of Xi = 19.63 <r t = 6.68 

Xi = 100 35 - 1.096X 2 S u = 5.80 

Xi = 166.07 - 1 . 866X3 S a = 4.81 

Xi = 119.45 - I.288X4 Su = 5.78 

Xi = 222.99 - 0.460X 2 - 1.420X, 

- 0.749X4 Si. 2i4 = 4.13 


Coefficient 

of 

correlation 


ru = - .4984 
rit = - .6948 
r I4 = - .5013 

Sum = .7861 


The value of S might be further reduced and the value 
of R correspondingly increased by bringing into the analysis 
other factors, such as rainfall during the growing months. 
The method which has been explained may be extended 
to cover any number of variables, one equation being added 
to the set of simultaneous equations for each additional 
variable introduced. 
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THE METHOD OF MULTIPLE CORRELATION VALID FOR UNEAR 

RELATIONSHIPS 

One important condition has not been emphasized in the 
course of the preceding discussion. The validity of this 
method of multiple correlation depends upon the existence 
of a linear relationship between each pair of variables. 
Thus with four variables there were six pairings possible 
(i.e., six mean products were computed). If there had been 
a material departure from linearity in any of these six 
relationships the significance of the results would have been 
decreased. There would be no fallacy involved in the use 
of the equation under these conditions, but it would not 
furnish as good a basis for estimates as one which took 
account of the true relationship. In such a case the values 
of & and R would indicate that the estimates based upon 
the assumption of linear relationship were not very reliable. 1 


AN APPUCATION OF THE METHOD 

Let us illustrate the use of the estimating equation. 
In the year 1933 the average June temperature in Kansas 
was 80.5° F., the average July temperature was 81.4° F., 
and the average August temperature was 76.8° F. What 
was the probable corn yield per acre? Substituting these 
values for X 2 , X 3 , and X t in the equation, 

Xi = 222.99 - 0 . 460X2 - 1 . 420X 3 - 0.749X, 
we have 

Xi = 222.99 - (0.460 X 80.5) - (1.420 X 81.4) 

- (0.749 X 76.8) 

Xt = 12.85. 

The estimated yield for 1933 is thus 12 . 85 bushels per acre. 

1 An approach to problems of multiple correlation when the relationships 
among the variables are non-linear is explained by M. J. B. Ezekiel in the 
Journal of the American Statistical Association , Vol. XIX, N. S. No. 148, 1924, 
and in his book Methods of Correlation Analysis . 
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What are the limits within which we may expect the 
actual yield to fall, with respect to this estimate? The value 
of /Si . M 4 is 4.13 bushels. This means that the odds are 
68 out of 100 that the actual yield will be within the 
limits 8.72 bushels (i.e., 12.85 — 4.13) and 16.98 bushels 
(i.e., 12.85 +4.13). The actual yield in 1933 was 11.5 
bushels per acre. 

In this illustration we have used one of the years in- 
cluded in the study. The same method would be employed 
in making an estimate for a future year. (Additional ele- 
ments of uncertainty are introduced, of course, whenever 
results secured for one period are applied to another time 
period.) Thus, from the temperatures in 1936 (76.7° in 
June, 85.5° in July, and 84.4° in August), an estimate 
of 3 . 1 bushels per acre is yielded by the regression equation 
employed above. This was a summer of exceptional heat 
and drought. The actual yield was 4 . 0 bushels per acre. 

The Meaning of Partial or Net Correlation 

In the preceding section we have sought to determine 
the degree to which corn yield in Kansas is affected by the 
temperature in June, July, and August, treating the three 
independent variables in combination. Our aim has been 
to measure their combined effect upon corn yield. There 
is a related problem, which in many studies may be of 
major importance. This is the determination of the rela- 
tionship between a dependent variable and a single indepen- 
dent variable when all other factors included in the study are 
held constant. Concretely, what would be the effect upon 
corn yield of variations in July temperature, if June tempera- 
ture and August temperature could be held constant? This 
is the problem of net or ■partial correlation. 

It is obvious that if a method could be developed by 
which two variables could be isolated for separate study, it 
would add immeasurably to the analytical powers of the 
economist, and of social scientists in general. It would give 
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to the student in these fields that power to eliminate irrele- 
vant influences and to concentrate his attention upon a single 
factor which is possessed by the chemist, for example. In 
studying the effect of one element upon another the chemist 
seeks to eliminate all other elements, and the effectiveness of 
his analysis depends in large part upon the degree to which it 
is possible thus to isolate the object of immediate interest. 

It is not generally possible in economic analysis to 
eliminate all but one of the factors responsible for variations 
in a given series. The direct and indirect causes of a given 
economic phenomenon are too numerous and too complicated 
in their interaction for the economist ever to hope to emulate 
the chemist in reducing his problem to terms of but two 
variables. But, within certain limits, the statistician is 
able to employ the method of the physical scientist in 
holding constant certain factors while the effects of varia- 
tions in another are studied. The methods which make this 
possible are among the most powerful of the instruments 
which the student of the social sciences possesses. 

The method of partial correlation may be explained with 
reference to the problem of corn yield in Kansas. Our 
object is to determine the net correlation between corn 
yield and the temperature in each of the three months 
for which the average temperature is given. 

DISTINCTION BETWEEN PARTIAL AND SIMPLE CORRELATION 

It is im portant, to distinguish between this problem and 
that faced in the ordinary measurement of relationship 
between two variables. We have already secured, as a 
description of the average relationship between corn yield 
and July temperature, the equation 

Xi = 166.07 - I.866X3 

with 


and 


jSu = 4.81 
rjj = — . 6948. 
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These measures describe the relationship in question when 
all other factors are ignored. They are not taken account 
of. They are merely neglected. It is as though the chemist, 
in studying the reaction of one element to another, used 
a test tube containing various impurities, which he made 
no attempt to remove. The economist cannot, in general, 
locate and remove all the “impurities” in his problem, but 
he should recognize that his measures relate to such 
uncorrected data. 

The Method of Partial Correlation 

In seeking to determine the net correlation between com 
yield and July temperature we attempt to secure a measure 
of the correlation which would prevail if other factors might 
be held constant. We shall take full account of the other 
factors we have studied, but we shall try to secure a meas- 
ure influenced only by fluctuations in July temperature, in 
relation to corn yield. 

One possible method of accomplishing this end may be 
suggested. If one possessed data covering a very long 
period we might be able to pick out a number of years 
during which the average temperatures in June and August 
remained unchanged. Let us say that we could find thirty 
years in all, during each of which the June temperature 
averaged 74° and the August temperature 78°. Corn yield 
and July temperature varied during these years. The re- 
lationship between July temperature and corn yield might 
now be measured, and it would be certain that the results 
would not be affected by the presence of fluctuations in 
June temperature and August temperature. Unfortunately, 
this method of holding certain factors constant cannot be 
employed. The data are too limited and too varied, in 
general, to enable us to pick from among them such figures 
as are appropriate to our purpose. Other methods of 
arriving at the same end are available, however. 

As a first step, let us derive the equation defining the 
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relationship between corn yield as dependent variable and 
June temperature and August temperature as independent 
variables. This will be of the form 

Xi = a + 612.4X2 + 614.2X4. 

We solve for the constants exactly as in the preceding 
example, except that variables X h X 2 , and X 4 only are; 
employed. The desired equation is 

Xi = 160.97 - 0.856X 2 - 1.010X 4 . 

We may determine the value of the standard error of 
estimate from the relation 

$ 2 1.24 = CTi 2 — &12.4P12 — 614.2P14. 

We secure 

5*1.24 = 27.2112 
5i.24 = 5.22. 

If corn yield per acre is estimated from June temperature 
and August temperature the standard error of estimate, 
or the standard deviation of the remaining variability, is 
5.22 bushels. But we know that if corn yield is estimated 
from June, July, and August temperature, the standard 
error of estimate, or the standard deviation of the remaining 
variability, is 4.13 bushels. The measure of remaining or 
“unexplained” variability is reduced from 5.22 to 4.13 
by the addition of July temperature (X 3 ) to the estimating 
equation, after account has already been taken of the 
influence of June temperature (X 2 ) and August temperature 
(X 4 ). The difference between these two measures may be 
taken to represent a relationship between Xi and X 3 which 
is not affected by variations in X 2 and X 4 . 

We have seen that the degree of correlation between a 
dependent variable (Xi) and an independent variable (X 3 ) 
may be defined by the relation 

5*13 
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Here the denominator of the fraction in the right-hand 
member defines the original variability of Xi, while the 
numerator of that fraction defines the variability of Xi 
after account has been taken of the influence of X*. In 
the present problem we have 


r 


*» = 


23.1134 

44.6878 


fu = — .695. 


.4828 


The coefficient of correlation is given the sign of b u , the 
coefficient of regression. 

In exactly the same way, we may say that the net 
correlation between Xi and X 3 , when the relationship is 
not affected by fluctuations in X 2 and X 4 , is defined by the 
relation 


r 2 18.24 = 1 — 


< 5 * 1.234 

$* 1.24 ‘ 


Here the denominator of the fraction in the right-hand 
member defines the variability remaining in Xi after account 
has been taken of the influence of X 2 and X 4 , while the 
numerator defines the variability remaining in X 4 after 
account has been taken of the influence of X 2 , X 3 , and X 4 . 
Numerator and denominator differ only because of the presence 
of correlation between Xi and X 2 that is incremental to any 
correlation that may exist between Xi on the one hand and 
X 2 and X 4 on the other. If the equation 

Xi = 222.99 - 0.460X 2 - 1.420X 3 - 0.749X 4 


gives estimates no more reliable than those derived from 
the equation 

Xi = 160.97 - 0 . 856Xj - I.OIOX 4 

then numerator ($* 1 , 234 ) and denominator ($* 1 . 24 ) of the above 
fraction will be equal, and r*i 3 .24 will have a value of zero. 
But if the equation containing X 2 , X 3 , and X 4 as independent 
variables gives better estimates than does the equation 
containing only X 2 and X 4 , the numerator will be smaller 
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than the denominator, and r 2 i 3 .24 will have a value other 
than zero. If the estimates based upon the three independ- 
ent variables are in exact agreement with observed yields, 
S*i .234 will be equal to zero, and r 2 i 2 . 3 4 will have a value of 
unity. 

Employing the values derived above, we have 


13.84 = 1 — 


17.0746 

27.2112 


= .3725 


7*13.24 — — -610. 


The coefficient of net correlation, r 13 . 24 , is negative, having 
the same sign as the coefficient of net regression, 613 . 24 . 

The quantity 7 * 13.24 measures the degree of correlation 
between Xi and X 3 when neither one is affected by variations 
in Xi and X 4 . It may be thought of, equally well, as a 
measure of the degree to which errors ii| estimating Xi 
are reduced when use is made of X 3 , after full account has 
already been taken of the influence of X 2 and X 4 on Xi. 

The meaning of the symbols employed' in the above 
demonstration should be clear from the context. As in 
the coefficients of net regression, the first of the subscripts 
to the left of the point (the primary subscripts) refers to 
the dependent variable; the second of the primary sub- 
scripts refers to the single independent variable to which 
the measure of net correlation applies specifically. The 
subscripts to the right of the point (the secondary sub- 
scripts) indicate the variables which are held constant for 
the purpose of the particular comparison being made. The 
number held constant is two in the present case, though 
it might be one, or any other number. Thus the general 
formula for the coefficient of net correlation between vari- 
ables Xi and X* would be 


r*13.24M 


S \ .23456 
<S S 1.246« 


n 


n 


The variable that is present in the numerator and absent 
in the denominator is the particular independent variable 
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that is being paired with the dependent variable lor the 
purpose of measuring net relationship. 

The coefficients of net correlation between Xi and each 
of the other independent variables may be derived in similar 
fashion. Thus 


r 2 i2.»4 = 1 


<S 2 1.4«4 

<S 2 1.84 


r 2 14.S3 


1 _ 

S* 1.23 


In each case the difference between numerator and denomi- 
nator of the fraction in the right-hand member measures 
the net reduction in the variability of Xi which is associated 
with a relationship between X x and a single independent 
variable, account having already been taken of the influence 
of two other variables. 

It is clear that such measurements as these are net only 
with respect to the variables represented by the secondary 
subscripts. The coefficient r 12,34 measures the degree of 
relation between Xi and X 2 when X 3 and X 4 are held con- 
stant. There may be many other factors affecting Xi and 
Xt) the disturbing influences of such factors have not been 
eliminated. These other factors still muddy the water of 
analysis. Ignoring them is not the same as holding them 
constant. Only by direct measurement and inclusion in 
the study, as was done with X 3 and X 4 , may the influence 
of additional variables be effectively eliminated. 


Another Method of Computing Coefficients of 
Partial Correlation 

Obviously a whole series of coefficients of net correlation 
may be computed in dealing with a number of variables. 
In deriving a number of such measurements a method may 
be utilized which differs somewhat from that employed 
above, and which has certain advantages in the way of 
systematic arrangement. 
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A simple coefficient of correlation relating to but two 
variables is termed a coefficient of zero order . Such coefficients 
are represented by symbols of the type r 12 , r 24 , etc. Coeffi- 
cients of net correlation which relate to two variables, 
while a single additional variable is held constant, are 
termed coefficients of the first order, and are represented by 
symbols such as ri 2 . 3 , r 24 . 3 , etc. Similarly, we may have coeffi- 
cients of the second, third, fourth, or nth order, depending 
upon the number of variables held constant while the relation- 
ship between a single dependent and a single independent 
variable is being measured. 

It is possible to derive each coefficient of partial correla- 
tion from those of the next lower order. Thus a coefficient 
of the first order may be derived from the relation 

_ r 12 — ris • r 23 

W ' 3 ~ (1 - r\ 3 )i (1 - r-’ 23 )»‘ 

For a coefficient of the second order 

_ 7*12.3 “ 7*14.3 * 7*24.3 

1234 ” (1 - rVa)* (1 - r 2 2 4. 3 ) J ' 

As a general equation for a coefficient of net correlation 
of any order , 1 we have 

_ _ ri 2 . 3 45 . . . („_1) — r 1„. 3 46 . . . ( n _i) • r 2n . 3 4 3 . . . („_1) 

* 12.345 . . . n /-j o \i /■* o \i * 

U — 7*^i n .345 . . . (n-l)P U 7*^ 2 n.345 . . . (w-ld* 

Thus it is possible, starting with the zero order coefficients 
of correlation, to compute all higher order coefficients 
successively. The mere arithmetic of calculation would be 
laborious, but certain prepared tables reduce these computa- 

1 It will be noted that in an equation used in computing a coefficient of 
partial correlation the three r’s in the numerator of the right-hand member 
have the same secondary subscripts, and that these secondary subscripts are 
one less in number than the secondary subscripts of the left-hand member; 
that the first r in the numerator has the same primary subscripts as the left 
hand member; that the second and third r's in the numerator have primary 
subscripts composed of one of the primary subscripts of the left-hand member 
plus the missing secondary subscript; that the two r’s in the denoninator are 
the same as the second and third r’s in the numerator. 
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tions to a minimum . 1 The method may be illustrated, using 
the data of the preceding problem. 

In the present case we require three coefficients of the 
second order, r 12 . 34 , r*i 3 . 2 4, and r u .23. These will serve as 
measures of the net correlation between corn yield and 
temperature in each of the three critical months. The 
formula from which the first of these measures may be 
computed was given above. For the second, we have 

7*13.2 — 7*14.2 ' 7*34.2 

ri3 ' 24 ~ (1 - rV *)* (1 - 
and for the third 

7*14.2 “ 7*13.2 * 7" 43.2 

7*!4.23 - (1 _ r 2 13 j) | (1 _ r 2 43 s)} * 

But each of these values may be derived from a slightly 
different grouping of first order coefficients. We may use 
the three formulas 

7 * 12.4 — 7 * 13.4 • Tt3A 
rU * 34 “ (1 - 7- J 13.4)* (1 - 7^3.4)* 

_ 7 * 13.4 — 7 * 12.4 * 7 * 32.4 

ri3 * 24 "" (1 - 7 * 2 12 . 4 )» (1 - 7 - 2 32 . 4 )* 

__ 7*14.3 “ 7*12.3 * 7*42.3 

ri4 - 23 ” (1 - r 2 (1 - 7- 2 42.3)»' 

By employing both methods in computing each second 
order coefficient a check upon the calculations is afforded. 

COMPUTATION OF FIRST ORDER COEFFICIENTS 

The second order coefficients cannot be computed until 
all necessary first order coefficients have been secured. 
The necessary equations, of the type 

7*12 ~ 7*13 • T23 

r “* 3 " (1 - t* 2 13)* (1 - r 2 23 )* ’ 

may be constructed from the general formula for coefficients 
of partial correlation. Since several of these values must 
be computed, a systematic arrangement should be employed. 

1 J. R. Miner, Tables of Vl — r 2 ami 1 — r 2 /or use in Partial Correlation and 
in Trigonometry , Johns Hopkins Press, Baltimore, Md., 1922. 
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Table 130 

illustrating the Computation of First Order Coefficients of Partial 

Correlation 


(Kansas com yield and temperature) 


r 0 Order 


Product 

Whole 

numerator 

Denom- 

inator 

r 1st Order 

Sub- 

script 

Coef- 

ficient 

(1 - rrf 

term of 
numerator 

Sub- 

script 

Coef- 

ficient 

12 

- .4984 


- .2736 

- .2248 

.6611 

12.3 

- .3400 

13 

- .6948 

.7192 






23 

+ .3938 

.9192 






14 

- .5013 


- .1993 

- .3020 

.6890 

14.3 

- .4383 

13 

- .6948 

.7192 






43 

+ .2868 

.9580 






24 

+ .2775 


+ .1129 

+ . 1646 

.8806 

24.3 

+ .1869 

23 

+ .3938 

.9192 






43 

+ .2868 

.9580 






13 

- .6948 


- . 1963 

- .4985 

.7969 

13.2 

- .6255 

12 

- .4984 

.8669 






32 

+ .3938 

.9192 






14 

- .5013 


- .1383 

- .3630 

.8329 

14.2 

- .4358 

12 

- .4984 

.8670 






42 

+ .2775 

.9607 






34 

+ .2868 


+ .1093 

+ .1775 

.8831 

34.2 

+ .2010 

32 

+ .3938 

.9192 






42 

+ .2775 

.9607 






12 

- .4984 


- .1391 

- .3593 

.8313 

12.4 

- .4322 

14 

- .5013 

.8653 






24 

+ .2775 

.9607 






13 

- .6948 


- . 1438 

- .5510 

.8290 

13.4 

- .6647 

14 

- .5013 

.8653 






34 

+ .2868 

.9580 






23 

+ .3938 


+ .0796 

+ .3142 

.9204 

23.4 

+ .3414 

24 

+ .2775 

.9607 






34 

+ .2868 

.9580 
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The procedure in computing each first order coefficient 
is simple. Three zero order coefficients are necessary for 
each calculation. These should be arranged in the table 
in the order in which they occur in the numerator of the 
fraction from which the required coefficient is to be com- 
puted. The numerator of this fraction is secured by sub- 
tracting from the first zero order coefficient the product of 
the other two. This product term appears in one column 
of the table. The denomina tor of t he fraction is the product 
of two terms of the type Vl — r 2 , derived from the second 
and third coefficients in each group of three. The tabular 
arrangement of Table 130 on page 557 permits these com- 
putations to be carried forward systematically. 

The coefficient r 23 .4 is, of course, identical with r 82 . 4 ; 
r !4 .2 is identical with 7*43.2, etc. It is unnecessary to duplicate 
the work of computation with respect to these measures. 

COMPUTATION OF SECOND ORDER COEFFICIENTS 

From these first order coefficients the three required 
second order coefficients may be secured by methods analo- 
gous to those employed above. The computations are 
shown in Table 131. As a check upon the calculations each 
required measure is computed from two different combina- 
tions of the first order coefficients. 

The value of rj 3 . 2 4 , it will be noted, is the same as that 
derived from the relation between £i . 24 and $ 1 , 234 . 

The meaning of such coefficients as these was explained 
in the earlier section dealing with this problem. The follow- 
ing summary of results reveals the gain in knowledge which 
has resulted from the above analysis. 

7*12 = — .4984 7*12.34 = — .2923 

7*13 = — .6948 7 * 13,24 = — .6101 

7*14 = — .5013 7 * 14.23 = — .4057 

It is clear that the net effect of June temperature upon 
corn yield is distinctly less than was indicated by the simple 
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Table 131 

Illustrating the Computation of Second Order Coefficients of Partial 

Correlation 

(Kansas com yield and temperature) 


r 1st Order 


Product 

Whole 

numerator 

Denom- 

inator 

r 2nd Order 

Sub- 

script 

Coef- 

ficient 

(1 - r 2 )* 

term of 
numerator 

Sub- 

script 

Coef- 

ficient 

12.3 

- .3400 


- .0819 

- .2581 

.8830 

12.34 

- .2923 

14.3 

- .4383 

.8988 






24.3 

+ . 1869 

.9824 






13.2 

- .6255 


- .0876 

- .5379 

.8816 

13.24 

- .6101 

14.2 

- .4358 

.9000 






34.2 

+ .2010 

.9796 






14.2 

- .4358 


- .1257 

- .3101 

.7643 

14.23 

- .4057 

13.2 

- .6255 

.7802 






43.2 

+ .2010 

.9796 






12.4 

- .4322 


- .2269 

- .2053 

.7022 

12.34 

- .2924 

13.4 

- .6647 

.7471 






23.4 

+ .3414 

.9399 






13.4 

- .6647 


- .1476 

- .5171 

.8476 

13.24 

- .6101 

12.4 

- .4322 

.9018 






32.4 

+ .3414 

.9399 






14.3 

- .4383 


- .0635 

- .3748 

.9238 

14.23 

- .4057 

12.3 

- .3400 

.9404 






42.3 

+ . 1869 

.9824 







correlation. This is so because there is a positive correlation 
between temperature in June and temperature in July and 
August, so that the crude correlation of two variables 
alone shows June temperature as more important than it 
really is. For the same reason, all the net coefficients are 
less than the simple coefficients, though it is still apparent 
that July temperature is far more important, in relation 
to corn yield, than the temperature in either of the other 
months. 
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The coefficients of net correlation are net, of course, 
only with respect to the variables actually taken account 
of, and held constant. Thus there may be other factors, 
such as rainfall in June, July, or August, which affect 
corn yield and which are correlated with the temperature 
during these months. Were these included the various 
coefficients of net correlation might have different values 
from those given. 

The sampling error of a coefficient of partial correlation 
may be estimated from the same general relations that 
hold for zero order coefficients, except that the factor N — 1 
must be further reduced by the number of variables held 
constant. Thus for ri 2 . 34 we have 

_ 1 - r 2 u . 34 

<rr "“ VA— ~3* 

A MEASURE OF VARIABILITY 

Having these coefficients of net correlation, another 
measure of some importance may be computed. This is 
a measure of the variability of a single character while a 
number of related variables are held constant. Thus the 
question might arise: If we could hold constant the tem- 
perature in Kansas in June, July, and August, what would 
be the variability of the corn yield? In other words: If we 
could eliminate such variability in corn yield as is due to 
variability in temperature, what fluctuations would remain 
in the yield of corn? This measure of variability is repre- 
sented by the symbol <ri. 234 . . . „. It is termed the standard 
deviation of order n. 

This measure may be computed from the general equation 

<r 2 ui . . . » = ori 2 (l - r 2 u)(l - r 2 i 3 . 2 )(l - r 2 i 4 . 23 ) . . . 

(1 r 2 i». 23 , . , n— l). 

Applying this formula to the results of the study of corn 
yield, we have 
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<r 2 u34 = 44 . 6878[1 -(- .4984) 2 ][1 -(- .6255) 2 ][1 -(- .4057)*] 

<t 2 i.234 = 17.0797 

(7l,234 ^ 4 . 13. 

Referring back to the discussion of this problem we find 
that the values of 0-1.234 and $1,234 are identical. That is, 
the standard deviation of variable Xi, when variables X t , 
X3, and Xi are held constant, is merely the standard devia- 
tion of observed values from computed values of X\. It 
is the standard error of estimate, when estimates are based 
upon the factors X 2) X 3 , and X 4 . The reason for this is 
obvious. The variability of the original series is reduced 
to the extent that estimates based upon the equation of 
relationship approximate the actual values. The variability 
which remains is due to differences between these estimates 
and the actual values. But these differences are merely 
the residual deviations, from which $ is computed. A re- 
alization of the identity of these two measures may assist 
in making their meaning clear. 

Since 0-1.234 and $1,234 are identical, the coefficient of 
multiple correlation, R1.234, may be computed from the 
equation 


or, using the formula for o- 2 i. S 3 4 • • • n, from the equation 
1 - Rh.23 . . . » = (1 - r 2 i 2 )(l - r 2 13 . 2 )(l - r 2 l 4 . 23 ) . . . 

(1 r 2 in.23 . . . (n— l) • 


BETA COEFFICIENTS 

The several coefficients of regression in an equation of 
multiple regression are, in effect, weights applied to the 
different independent variables in estimating the successive 
values of the dependent variable. Usually these coefficients 
of regression are not comparable, because the independent 
factors are expressed in different units, or because they 
differ in variability. It is often desirable to reduce the 
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coefficients of regression to comparable terms. This may 
be done by expressing dependent and independent variables 
alike in units of their respective standard deviations. The 
coefficients of regression are then called beta coefficients, 
and are represented by the symbols 012.34, 1813.24, etc. 

In terms of a simple two-variable problem, we have 

Xi ■= 613X3. 

If we change to standard deviation units we must divide 
both sides of the equation by <r\ and by <r 3 . This gives 



The desired Beta coefficient is, then, 

ft* = b n (^y 

For the corn yield example, we have 

tk.--l.axfe. |)-- 696 . 

This may be taken to mean that with an increase of one 
standard deviation in X 3 (July temperature), the yield of 
corn decreased . 696 of one standard deviation. 

These measurements are particularly useful in analyses 
involving more than two variables. Here the relationships 
between the beta coefficients and the coefficients of net 
regression are similar to those indicated for the two- 
variable problem. Thus 
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Substituting the required values in these equations, we have 

/3i2.34 = — .209 

— — .529 

|5l4.23 = — .292. 

The second of these coefficients may be taken to mean that 
with an increase of one standard deviation in July tempera- 
ture, when June and August temperatures are held constant, 
corn yield decreases .529 of one standard deviation. The 
other coefficients have similar meanings. 

The beta coefficients relate to factors expressed in com- 
parable units and similar in respect of variability. A 
fluctuation of one standard deviation in X 2 may be taken 
to be equal to a fluctuation of one standard deviation in X 3 . 
The coefficients defining the changes in Xi that are likely 
to accompany these equal movements in X 2 and X a have 
obvious significance. 

CERTAIN LIMITATIONS 

The measures we have described in dealing with problems 
of multiple and partial correlation are valid on the assump- 
tion that the relationships among the different variables 
are in all cases linear. Thus with four variables six different 
pairs may be obtained. The regression in each of these 
six cases should be linear if combined or net effects are to 
be studied by the methods outlined above. If the regression 
is non-linear when natural numbers are dealt with, it may 
be possible to secure linear relationships by correlating 
logarithms or reciprocals. Thus we might derive an esti- 
mating equation of the type 

Log Xi = a + & 12.3 4^L2 + 513.24^3 + & 14 . 23 A 4 

if the relation between Xi in logarithmic form and each 
of the other variables in the original arithmetic form were 
linear . The corresponding measures, S and 12, would then 
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relate to ratios, as in the examples given in the following 
chapter. 1 

One other important limitation should be noted. Coeffi- 
cients of multiple or of net correlation based upon a large 
number of variables have little significance unless the num- 
ber of observations be large. Misleadingly high values will 
be secured when studies involving many variables are based 
upon small samples. (Application of the corrections referred 
to in the text will prevent misinterpretation, in such cases.) 
Within the limits set by these restrictions, the methods of 
multiple and partial correlation constitute very powerful 
instruments of economic analysis. 
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CHAPTER XVII 


THE MEASUREMENT OF RELATIONSHIP AND 
THE PROBLEM OF ESTIMATION 

It is no great exaggeration to say that quantitative 
method in economics and business centers about the prob- 
lem of estimation. Equations of regression, measures of 
standard error and coefficients of correlation are of interest 
largely because of their bearing upon the practical problems 
of determining probable production, probable price, probable 
business changes. It should not be understood from this 
that the problem of estimation relates only to attempts to 
forecast future changes. We make an estimate whenever 
we seek to determine the most probable value from a 
number of different observations, or whenever we employ 
an equation which describes the relation between two or 
more variables. The value of statistical technique rests 
in large part upon its practical utility in the making of 
estimates. 

This object has been definitely to the fore in the preceding 
chapters, which dealt with methods by which the value 
of one variable might be estimated from a given value 
of another. We may, at this point, briefly summarize 
certain assumptions upon which the validity of this method 
rests. 

Some Assumptions Involved in the Making op 
Estimates 

In earlier chapters it has been pointed out that the most 
probable value of a series of observations is their arithmetic 
mean. Given a normal distribution about the mean, the 
standard deviation affords an exact measure of the proba- 
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bilities involved in basing estimates upon the mean. 
Similarly, the standard error of estimate affords an exact 
measure of the probabilities involved in basing estimates 
upon an equation of regression, again upon the assumption 
that the distribution about the line of regression is normal. 
The significance and usefulness of the equation of regression 
may be determined by comparing the standard error of 
estimate of a given variable with the standard deviation. 

From the relation between these two values, moreover, 
an abstract measure of relationship, the coefficient or index 
of correlation, may be computed. This coefficient, or index, 
is a thoroughly valid and accurate measure only if the 
distribution about the line of regression and the distribution 
about the mean are normal, or approximately so. Pro- 
nounced departures from the normal type lessen the signifi- 
cance of these measures. 

In the foregoing discussion we have been concerned with 
arithmetic values throughout. In speaking of estimates 
based upon the mean we referred to the arithmetic mean. 
The distributions about the mean and about the line of 
regression are assumed to be normal when deviations are 
measured arithmetically. The standard deviation and the 
standard error of estimate are in arithmetic terms, referring 
to absolute values. But may we assume that all the 
distributions we deal with in economic analysis are of the 
arithmetic type? Should estimates be made and errors 
of estimate measured only in arithmetic terms? If they 
should not be so limited, are the methods developed above 
capable of adaptation to other distributions? These ques- 
tions may best be answered in terms of a specific problem. 

A Problem of Estimation: Logarithmic and Ratio 

Values 

In Table 132 the production and price of oats in the 
United States from 1881 to 1913 are recorded. Appropriate 
lines of trend were fitted to these series and the ratios 
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Table 132 


Production and Price of Oats in the United States 


Year 

Production 
of oats in 
U . S. 
{millions 

Straight 
line trend 
of produc- 
tion 1 

Ratio of 
actual pro- 
duction to 
trend value 

Price of 
oats in 
Chicago 
{cents 

Straight 
line trend 
of price 2 

Ratio of 
actual 
price to 
trend value 


of bu.) 

per bu.) 


1881 

416 

448 

.929 

47 

36.0 

1.30 

1882 

488 

471 

1.036 

37 

35.3 

1.05 

1883 

571 

494 

1.156 

31 

34.6 

.90 

1884 

583 

517 

1.128 

29 

34.0 

.85 

1885 

629 

540 

1.165 

28 

33.2 

.84 

1886 

624 

563 

1.108 

25 

32.5 

.77 

1887 

659 

586 

1.124 

30 

31.2 

.96 

1888 

701 

609 

1.151 

24 

30.5 

.79 

1889 

751 

632 

1.188 

24 

29.8 

.81 

1890 

523 

655 

.798 

43 

29.0 

1.48 

1891 

738 

678 

1.088 

31 

28.3 

1.10 

1892 

661 

701 

.943 

30 

27.5 

1.09 

1893 

639 

724 

.882 

31 

26.8 

1.16 

1894 

662 

747 

.886 

28 

26.1 

1.07 

1895 

824 

770 

1.070 

19 

25.3 

.75 

1896 

780 

793 

.983 

18 

23.6 

.76 

1897 

791 

816 

.969 

24 

25.0 

.96 

1898 

843 

839 

1.005 

25 

26.4 

.95 

1899 

926 

862 

1.074 

23 

27.8 

.83 

1900 

914 

885 

1.033 

25 

29.2 

.86 

1901 

778 

908 

.857 

42 

30.6 

1 37 

1902 

1,053 

931 

1.131 

33 

32.0 

1.03 

1903 

869 

954 

.911 

38 

33.4 

1.14 

1904 

1,009 

977 

1.033 

30 

34.8 

.86 

1905 

1,090 

1,000 

1.090 

31 

36.2 

.86 

1906 

1,036 

1,023 

1.013 

39 

37.6 

1.04 

1907 

805 

1,046 

.770 

51 

39.0 

1.31 

1908 

851 

1,069 

.796 

52 

40.4 

1.29 

1909 

1,068 

1,092 

.978 

43 

41.8 

1.03 

1910 

1,186 

922 

1,115 

1064 

35 

43.2 

.81 

1911 

1,138 

.810 

51 

44.6 

1.14 

1912 

1,418 

1,161 

1.221 

37 

46.0 

.80 

1913 

1,122 

1,184 

.948 

41 

47.4 

.87 


1 This line of trend was fitted to data covering a longer period than that in- 
cluded in the present study. 

2 The entire period has been broken into two parts, 1881 to 1895 and 1896 
to 1913. A straight line of trend was fitted by H. 13. Killough to the data of 
each period. 
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of the actual values of the items in each series to the trend 
values determined. 

It is desired to measure the relation between these two 
variables. A hyperbolic curve of the general type Y = aX b 
appears to be an appropriate form to employ in describing 
such a relationship. To fit this curve by the method of 
least squares, the equation must be reduced to the loga- 
rithmic form 

log Y = log a + 6 log X. 

The normal equations required in fitting a curve of this 
type, are 

I 2 (log 7) = N log a + 6 2 (log X) 

II 2 (log X • log Y ) = log a2(log X) + 62 (log 2 X). 

The values necessary for the solution of these equations 
are determined from Table 133. 1 
From this table we have 

N = 33 

2 (log Y) = - .32849 2(logX - log Y) = - .1143005 

2 (log X) = .037535 2(log 2 X) = .096423. 

Substituting in the normal equations, we secure 

- .32849 = 33 log a + .0375356 
- .1143005 = .037535 log a + .0964236. 

Solving 

log a = — . 00861 
6 = - 1 . 18206. 

The required equation is 

log Y = (9.99139 - 10) - 1.18206 log X 


or 


Y= .9804X- 1 - 18206 . 

1 1 am indebted to Prof. H. B. Killough of Brown University for permission 
to use the data presented in Tables 132 and 133. The figures are taken from hi b 
comprehensive study of the factors affecting oat prices. 
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This is the equation which describes the average rela- 
tionship between the production and the price of oats 
(when the actual figures for each are expressed as ratios 
to the respective lines of trend). The corresponding curve 
is plotted in Fig. 88 on page 592. 


Table 133 


Computation of Values Required in Fitting a Curve to Data of Oat 
Production and Prices 
Example I 


(1) 

(2) 

(3) 

(4) 


(5) 


(0) 

(7) 


(8) 


Ratio 

Ratio 









Year 

of price 
to trend 

of pro- 
duction to 
trend 










Y 

X 

log Y 


log X 


log 1 Y 

log 1 X 

log Y- log X 

1881 

1.30 

929 

.1139434 


.9680157 - 

1 

.01298310 

.001022995 

— 

.0036444 

1882 

1.05 

1 030 

.0211803 


.0153598 


.00044899 

.000235923 


.0003255 

1883 

90 

1 156 

9542425 — 

1 

.0629578 


.00209375 

.003963085 

— 

.0028808 

1884 

85 

1 128 

.9294189 - 

1 

0523091 


.00498169 

.002736242 

— 

.0036920 

1885 

.84 

1 165 

.9242793 - 

1 

.0603259 


.00573362 

.004399125 

— 

0050222 

1880 

77 

1 108 

S8G4907 — 

1 

.0445398 


.01288430 

.001983794 

— 

0050557 

1887 

96 

1 124 

9822712 - 

1 

.0507663 


00031431 

.002677217 

— 

.0009000 

1888 

.79 

1 151 

.8976271 - 

1 

.0610753 


.01048021 

.003730192 

— 

0062524 

1889 

81 

1 188 

.9084850 - 

1 

0748164 


.00837500 

.005597494 

— 

.0068468 

1890 

1 48 

798 

1702617 


.9020029 - 

1 

.02898905 

.009603432 

— 

.0106852 

1891 

1 10 

1 088 

.0413927 


.0366289 


.00171336 

.001341676 


.0015102 

1892 

1 09 

.943 

.0374265 


9745117 - 

1 

.00140074 

.000649653 

— 

.0009539 

1893 

1 16 

882 

0644580 


.9454686 - 

1 

.00415483 

.002973674 

— 

.0035150 

1894 

1 07 

886 

.0293838 


.9474337 - 

1 

.00080341 

.002763216 

— 

.0015446 

1895 

75 

1 070 

8750C13 - 

1 

.0293838 


.01560968 

.000863408 

— 

0036712 

1898 

.76 

.933 

8808136 - 

1 

.9925535 - 

1 

.01420540 

.000055450 


0008875 

1897 

96 

969 

.9822712 - 

1 

.9863238 - 

1 

.00031431 

.000187038 


.0002425 

1898 

95 

1 005 

9777230 - 

1 

.0021661 


. 00049624 

.000004692 

— 

0000483 

1899 

83 

1 074 

.9190781 - 

1 

.0310043 


.00654835 

.000961267 

— 

0025089 

1900 

86 

1.033 

.9344985 - 

1 

.0141003 


.00429045 

.000198818 

— 

0009230 

1901 

1.37 

857 

1367206 


. 9329808 - 

1 

.01725310 

.004491573 

— 

.0091629 

1902 

1 03 

l 131 

.0128372 


.0534626 


.00016479 

.002858250 


0006863 

1903 

1 14 

911 

. 0569049 


.9595184 - 

1 

.00323817 

.001638760 

_ 

0023036 

1904 

.86 

1 033 

.9344985 - 

1 

.0141003 


.00429045 

.000198818 

— 

0009236 

1905 

86 

1 090 

.9344985 - 

1 

.0374265 


.00429045 

.001400743 

— 

0024515 

1906 

1 04 

1 013 

.0170333 


.0056094 


.00029013 

.000031465 


.0000955 

1907 

1 31 

770 

.1172713 


.8864907 - 

1 

.01375256 

.012884361 

— 

.0133113 

1908 

1 29 

.796 

.1105897 


.9009131 - 

1 

.01223008 

.009818214 

— 

.0109580 

1909 

1 03 

978 

.0128372 


. 9903389 - 

1 

.00016479 

.000093337 

— 

0001240 

1910 

.81 

1 064 

.9084850 - 

1 

.0269416 


.00837500 

.000725850 

— 

0024656 

1911 

1 14 

810 

0569049 


.9084850 - 

1 

00323817 

.008374812 

— 

0052076 

1912 

.80 

1.221 

. 9030900 - 

1 

.0867157 


.00939155 

.007519613 

— 

0084036 

J913 

87 

.948 

.9395193 - 

1 

.9768083 - 

1 

.00365792 

.000537855 


.0014027 

Total 

32 83 

33.338 

17.6715068 - 

18 

14.0375350 - 

14 

.21721807 

.096422642 


.1194567 


+ .0051662 


- .1143005 


THE STANDARD ERROR OF ESTIMATE IN LOGARITHMIC TERMS 

How reliable is this equation? With what degree of 
confidence may estimates be based upon it? To answer 
these questions we must compute the standard error, S. 
Since the fitting process was carried through in terms of 
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logarithms, the standard error may be computed in the 
same terms. Following the procedure explained in earlier 
sections with reference to the straight line and the potential 
series, we may derive the following equation relating to 
the logarithmic curve just fitted: 

oa _ 2 (log* Y) - log o2( log Y) - 62 (log X ■ log Y) 

° log v — * 

Substituting the proper values, we have 

-S^ogv 

= .21721807 - ( -.00861 X -.32849) - (- 1.18206 X - .1143005) 

33 

_ .07927928 
= 33 

-S 2 i og „ = .0024024 
S \o ty = .04901. 

The standard error of estimate, in the form of a loga- 
rithm, is .04901. As long as we deal with logarithms, 
this is to be interpreted precisely as is the standard error 
with respect to other curves. Assuming a normal distribution 
of logarithms about the curve which describes the average re- 
lationship, the chances are 68 out of 100 that the logarithm 
of a given estimate will not differ from the logarithm of the 
actual value by more than .04901, 95 out of 100 that the 
logarithm of the given estimate will not differ from the 
logarithm of the actual value by more than .09802, and 
99.7 out of 100 that the logarithm of the given estimate 
will not differ from the logarithm of the actual value by 
more than . 14703. 

INTERPRETATION OP THE STANDARD ERROR OP ESTIMATE,* 
ZONES OP ESTIMATE 

What does this mean in terms of actual values? It means, 
simply, that we are dealing throughout in terms of ratios 
instead of absolute figures. The difference between the 
logarithms of two numbers is the logarithm of the ratio 
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of one of the original numbers to the other. Thus the 
absolute value of 5 in a given case will depend upon the 
magnitude of the values with which we are dealing. If 
the user desires to reduce 5 to absolute values, it must be 
done always with reference to a given estimate. That is, 
a given value of X is substituted in the equation of average 
relationship and the corresponding value of Y estimated. 
If the logarithmic equation is used, this estimate will be 
in the form of a logarithm. To the logarithm of the estimate 
add the value of S\ ogy . The anti-logarithm of the number 
thus secured will give the upper limit of a zone extending 
a distance equal to 5 above the line of regression. From 
the logarithm of the estimate subtract the value of S\ oeu . 
The anti-logarithm of the number thus secured will give 
the lower limit of a zone extending a distance equal to 
5 below the line of regression. The odds are 68 out of 100 
that the value of Y in the given case will fall within the 
limits thus marked out. The absolute limits corresponding 
to 25 and 35 may be similarly determined. 

The zone thus marked out with respect to a logarithmic 
curve will differ materially from the similar zones already 
described in dealing with simple linear equations. In the 
simple case a zone extending 1<S on each side of the estimating 
curve has the same absolute width throughout its length, 
and is centered always at the line of regression. The loga- 
rithmic zone, when measured in natural numbers, is of 
varying width, and, moreover, is not of the same width 
on each side of the plotted curve. It is true, however, that 
the ratios on the two sides of the curve are always equal. 
That is, the ratio of a value 15 less than the computed 
value to the computed value is the same as the ratio of 
the latter to a value 15 greater. And when the curves 
are plotted on paper ruled logarithmically, the zone included 
within a distance 15 on each side of the plotted curve 
takes the symmetrical form found in the earlier and simpler 
cases. A person accustomed to thinking in terms of ratios 
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and to the use of logarithmic paper can readily interpret 
this measure. 

THE STANDARD ERROR OF ESTIMATE IN TERMS OF RATIOS 

Since the ratios are equal throughout, the standard error 
of estimate may be expressed in ratio terms. In the present 
example we have 

S T = anti-log &iog v = anti-log . 04901 = 1.12 

where S t is used to represent the standard error of estimate 
in terms of ratios. <Siog„, as derived above, is positive, 
hence the ratio exceeds unity. It is the ratio of the larger 
number to the smaller. What does it mean? It means 
that in 68 cases out of 100 the actual value, if it exceed 
the estimate, will not exceed it by more than 12 per cent, 
and, if it fall below the estimate, will stay within a 
limit such that the estimate will not be more than 12 per 
cent greater than the actual value. This is not a conven- 
ient form, since this ratio always expresses the larger value 
in terms of the smaller value. It would be more conven- 
ient to have it always in terms of a percentage of the esti- 
mate. This may be done by putting Si ogv in negative terms, 
and getting the corresponding natural value. The value 
— .04901 = 9.95099 — 10, which is the logarithm of .8933. 
In this form the ratio is based upon the relation of the 
smaller to the larger number. To make S t readily intelligible 
we may combine the two, writing 

S t = .89 to 1.12. 

Interpreting this, it means that, given a normal distribution, 
in 68 cases out of 100 the actual value will not be less than 
89 per cent of the estimate, or more than 112 per cent of 
the estimate. This has a simple, definite meaning more 
significant for most practical purposes than a similar 
measure in terms of absolute values. 1 

1 The significance of a measure of reliability in percentage form was pointed 
out by D. H. Davenport in 1922, in an unpublished article, and such a measure 
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To find the values of 25 or 35 these percentage figures 
may not be simply multiplied by 2 or 3. The value of 
Si 0 g V must be so multiplied, and the resulting values reduced 
to natural numbers. For convenience in use, the anti- 
logarithms of both the positive and negative values should 
be secured, as in the preceding case. The computations 
are simple. 

25i og v = .09802. 

The anti-logarithm of this value, when considered positive 
is 1.25, when negative, . 80. 

35iog y = .14703. 

The corresponding anti-logarithms are 1.40 and .71. Sum- 
marizing for the standard error, we have 

S r = .89 to 1.12 
2 S r = .80 to 1.25 
35 r = .71 to 1.40. 

The values given for 5 r indicate the probable percentage 
limits within which actual value and estimated value should 
fall in 68 out of 100 cases. The values given for 2 S r indicate 
the probable percentage limits in 95 out of 100 cases. 
The values of 3 5 r indicate the probable percentage limits 
in 99.7 cases out of 100, always on the assumption of a 
normal distribution of the logarithms of the actual values 
about the fitted curve. 

APPLICATION OF THE STANDARD ERROR OF ESTIMATE 

We may illustrate the use of 5i 0gJ ,. Given a production 
of oats 50 per cent above the trend value (i.e., the ratio to 
trend is 1.50), what is the most probable accompanying 
price ratio and what is the degree of accuracy of this 
estimate? 


has been employed in several studies. There has not been available, however, 
a ready method of computing this measure, and its possibilities have not, there- 
fore, been fully realized* 
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The estimating equation is 

log Y = (9.99139 - 10) - 1.18206 log X. 

Substituting in this equation the value .176091 (the loga- 
rithm of 1 . 50) we secure for log Y the value 9 . 78324 — 10. 
The corresponding natural number is .607. This means 
that if production is 150 per cent of normal (as measured 
by the given line of trend) price will probably be 60.7 
per cent of normal (as measured by the line of trend). 

To determine the reliability of this estimate, the standard 
error must be secured. Employing the values of S r already 
computed we find that 54 is 89 per cent of 60.7, while 
68 is 112 per cent of 60.7. We interpret these figures to 
mean that in 68 cases out of 100 the actual price prevailing 
under the given production conditions will not be less than 
54 per cent of the normal or trend value nor more than 
68 per cent of normal. 1 Corresponding values for 2S r 
and 3<S r may be determined in the manner outlined above. 

THE INDEX OP CORRELATION BASED ON LOGARITHMIC VALUES 

We have still to compute the third measure, the abstract 
index of correlation. 2 For an equation of the type 

log Y = log a + 6 log X 

the formula for p reduces to 

, _ log a2(log Y) + 62 (log X • log Y) - Nc\ ot y 

P log y log * 2(log 2 Y) - Nc* log y 

where ci og „ represents the difference between the arithmetic 
mean of the logarithms of the F-values and the origin 
(in this case, zero on the logarithmic scale). Substituting 

1 A question arises at once as to the adequacy of the given lines of trend, 
in the present problem. This question is discussed in greater detail in another 
section. 

2 The symbol p is used for this measure of correlation, instead of r, even 
though the relationship in logarithmic form is linear. This is done because such 
a measure, in terms of logarithms, cannot be interpreted in precisely the same 
way as the ordinary coefficient of correlation. 



576 THE PROBLEM OF ESTIMATION 


the proper values, we have 

P*log v log * 

(-.00861 X - .32849) + (- 1.18206 X - .1143005) - (33 X .000099091 
“ .21721807 - (33 X .00009909) 

.13466882 

.2139481 

= .629445 
Piog v log * = ■ 793. 

The index of correlation has a value of .793. How is 
this to be interpreted when we are dealing with logarithms 
as in the present case? 

Its significance may be clearer if viewed in terms of the 
relationship 

2 , *S^Iog y 

P log V log X — 1 ' » 

log V 

In the present case these values are 

Siog, = 04901 
ff iog v = • 08052. 

When these values are squared and inserted in the above 
formula, we have 

, , .002402 

P log, log* - 1 006483 


and 


Piog v i ok x — • 793. 

What does this value measure? We have seen that r 
and the more general index p are abstract measures of the 
degree of relationship between two variables, as this rela- 
tionship is described by given functions. The value of p 
in a given case depends upon the variability about the 
fitted line, in relation to the variability about the mean 
of the F’s. If the variability of estimates is materially 
reduced when the equation of regression is used as a basis 
for estimates, instead of the mean Y, the equation may be 
assumed to describe a significant relationship. The value 
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of p depends thus upon the relation between the two 
quantities, S y and cr„. 

In the cases dealt with in the preceding chapter the 
variability in each case was measured in terms of absolute 
deviations, and the value of p depended upon the relation 
between the two given measures of absolute variability. 
The sole difference in the present case is that we are working 
in terms of logarithmic or ratio variability, deviations being 
measured in terms of logarithms instead of natural numbers. 

The index p must be interpreted in the light of this fact. 
Its value, as always, depends upon the relation between 
two measures of variability, S 2 and <r 2 , but in the present 
instance these are expressed in terms of logarithms. In 
brief, the value of p depends upon the relation between 
the ratio variability about the fitted curve and the ratio 
variability about the geometric mean of the F’s. (It is 
the geometric mean of the F’s, because that is the value 
corresponding to the arithmetic mean of the Y logarithms.) 

We have here a set of measures, therefore, which perform 
in the field of ratios precisely the same service as is per- 
formed in the field of natural numbers by S and p (in the 
linear case, r). These measures are secured in the same 
way as are S and p, except that the equation of relationship 
from which they are derived is one in which the dependent 
variable is log Y (or, in the reverse case, log X ). The general 
formulas for computing these values are the same as in 
dealing with natural numbers, except that log Y replaces 
Y throughout. The operation is analogous to that of using 
logarithmic paper instead of natural scale paper. 

It should be noted that the values are in logarithmic or 
ratio form if Y is expressed logarithmically, whether X 
be so expressed or not. Thus we have fitted a curve of 
the type 

log Y — log a + b log X 

the logarithmic form of the ordinary parabola or hyperbola. 
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The values S and p would also be in logarithmic form if 
the curve were of the type 

log F = log a + X log b 
the logarithmic form of the exponential 

Y = a(b x ). 

In each of these cases the logarithmic equation is linear, 
but this is not essential to the use of these measures. S and 
p are generally applicable measures, whether ratios or nat- 
ural numbers be dealt with, and whether the functions be 
linear or otherwise. 

It may be well at this point to summarize the symbols 
that have been used and to distinguish the different meas- 
ures. We may employ the symbols S v , <r„, and p when 
arithmetic relations are in question, the two former being 
measures of variation in absolute terms, and the index p 
referring to degree of relationship when natural numbers 
are employed. If the logarithms of the F’s are used it is 
advisable to distinguish the symbols by subscripts, using 
Siogy and aiogy as measures of the logarithmic variation 
about the fitted curve and about the arithmetic mean 
of the logarithms of the F’s, respectively. If Si oe v is reduced 
to ratio form, it may be written S T . Since the index p 
must be interpreted somewhat differently in this case, it 
may be written piog y tog or piog y*. 


The Use op Reciprocals in the Measurement op 

Relationship 

Another type of curve may be used to describe the 
relationship between the production and price of oats, and 
its use introduces us to a third field of correlation, a field 
in which somewhat new concepts enter, and in which the 
various measures must be interpreted in still another way. 
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This is a curve of the type 

1 ~ a + bX 

which may be expanded by adding additional terms to 
the denominator, as 


Y = 

a + bX + cX 2 

This hyperbolic form has been used in several studies as 
an approximation to a “demand” curve for various com- 
modities. 

The equation to a curve of this type may be written 
Y = a + bX 

which is the equation to a straight line describing the rela- 
tionship between the reciprocals of the F’s and the original 
X values. The normal equations required in fitting a 
curve of this type are 

I = Na + 62(X) 

II = aS(X) + bZ(X)\ 

The method of computing the necessary values is illustrated 
in Table 134. 

Substituting the proper values in the normal equations, 
we have 

34.3360320 = 33a + 33.3386 
35.2571485 = 33.338a + 34.1685546. 

Solving, 

a = - . 1357 
6 = 1.1643. 
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Table 134 

Computation of Values Required in Fitting a Curve to Data of Oat 
Production and Prices 


Example II 


(1) 

(2) 

(3) 

Produc- 

(4) 

(5) 

(6) 

(7) 

Year 

Price 

tion 






Ratio 

Ratio 

1 

X 

/iy 



Y 

X 

Y 

Y 

\Y) 

X 2 

1881 

1.30 

.929 

.7692308 

.7146154 

.59171602 

.863041 

1882 

1.05 

1.036 

.9523810 

.9866667 

.90702957 

1.073296 

1883 

.90 

1.156 

1.1111111 

1.2844444 

1.23456788 

1.336336 

1884 

.85 

1.128 

1.1764706 

1.3270588 

1.38408307 

1.272384 

1885 

.84 

1.165 

1 . 1904762 

1.3869048 

1.41723358 

1.357225 

1886 

.77 

1.108 

1.2987013 

1.4389610 

1.68662507 

1.227664 

1887 

.96 

1.124 

1.0416667 

1.1708334 

1.08506951 

1.263376 

1888 

.79 

1.151 

1.2658228 

1.4569620 

1.60230736 

1.324801 

1889 

.81 

1.188 

1.2345679 

1.4666667 

1.52415790 

1.411344 

1890 

1.48 

.798 

.6756757 

.5391892 

.45653765 

.636804 

1891 

1.10 

1.088 

.9090909 

.9890909 

.82644626 

1.183744 

1892 

1.09 

.943 

.9174312 

.8651376 

.84168001 

.889249 

1893 

1.16 

.882 

.8620690 

.7603449 

.74316296 

.777924 

1894 

1.07 

.886 

.9345794 

.8280373 

.87343865 

.784996 

1895 

.75 

1.070 

1.3333333 

1.4266666 

1.77777769 

1 . 144900 

1896 

.76 

.983 

1.3157895 

1.2934211 

1.73130201 

.966289 

1897 

.96 

.969 

1.0416667 

1.0093750 

1.08506951 

.938961 

1898 

.95 

1.005 

1.0526316 

1.0578948 

1.10803329 

1.010025 

1899 

.83 

1.074 

1.2048193 

1.2939759 

1.45158955 

1.153476 

1900 

.86 

1.033 

1.162790 7 

1.2011628 

1.35208221 

1.067089 

1901 

1.37 

.857 

.7299270 

.6255480 

.53279343 

.734449 

1902 

1.03 

1.131 

.9708738 

1.0980583 

.94259594 

1.279161 

1903 

1.14 

.911 

.8771930 

.7991228 

.76946756 

.829921 

1904 

.86 

1.033 

1.1627907 

1.2011628 

1.35208221 

1.067089 

1905 

.86 

1.090 

1.1627907 

1.2674419 

1.35208221 

1.188100 

1906 

1.04 

1.013 

.9615385 

.9740385 

.92455629 

1.026169 

1907 

1.31 

.770 

.7633588 

.5877863 

.58271666 

.592900 

1908 

1.29 

.796 

.7751938 

.6170543 

.60092543 

.633616 

1909 

1.03 

.978 

.9708738 

.9495146 

.94259594 

.956484 

1910 

.81 

1.064 

1.2345679 

1.3135802 

1.52415790 

1.132096 

1911 

1.14 

.810 

.8771930 

.7105263 

.76946756 

.656100 

1912 

.80 

1.221 

1.2500000 

1.5262500 

1.56250000 

1.490841 

1913 

.87 

.948 

1.1494253 

1.0896552 

1.32117852 

.898704 

Total 

32.83 

33.338 

34.3360320 

35.2571485 

36.85702940 

34.168554 


The desired equation is, therefore, 

y = - . 1357 + 1 • 1643X. 



HARMONIC MEASUREMENTS 581 


THE STANDARD ERROR AND THE INDEX OF CORRELATION IN 
TERMS OF RECIPROCALS 

To determine the utility of this equation we must have 
the standard error and the index of correlation. The two 
necessary formulas may be derived as in the preceding 

cases. Representing by p the reciprocal of an actual value 

we have, for each residual, 

d = a + bX - y- (1) 

Multiplying by d and summing 

2(d J ) = aS(d) + VL(dx) - 2 (p) • 

Since 


we have 


2 (d) = OandS (dX) = 0, 


2 (d 2 ) = - 21 


Multiplying the residual equation (1) now by p> and sum- 
ming, we have 

Kr) “ «K?) + B (f ) - 2 (r) ! ' 

Substituting the equivalent of 2 0) in the preceding equa- 
tion (2), we secure 

s (d 2 ) = 20 )’ - o20) - 620) 


and for Si 2 , we have 


Kf)' - ° z (p) - Kj) 


v 
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Inserting this value of Si 2 in the general formula for the 

V 

index of correlation 



V 


and simplifying, we have 



Inserting the proper values in these two equations, we find 
that 

Si = .1191 
Pi x = .766. 

V 

For the standard deviation of the original F-values, in 
terms of reciprocals, we secure 

<T X = . 1851. 

V 

(The subscript ^ is used in connection with each of these 

measures, as they should be distinguished from measures 
based upon natural numbers or logarithms.) 

INTERPRETATION OP THE STANDARD ERROR OF ESTIMATE 

How may we interpret these results? As in all former 
problems of t,his type the equation gives us a means of 
estimating Y from a known value of X. The standard 
error Si serves as a measure of the reliability of such 

V 

estimates, and pi z is an abstract measure of the degree of 

V 

relationship between the two variables. But in the present 
case all these measures are in terms of reciprocals. The 
equation enables us to estimate the reciprocal of Y, the 
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standard error has significance only in the form of a recipro- 
cal, and the value of p depends upon the relation between 
two measures (Si 2 and tri 2 ) both of which are in terms of 

reciprocals. 

An illustration may make these meanings clear. If, in 
a given year, the production of oats is 150 per cent of 
trend, what is the most probable price? Substituting in 
the equation 


- .1357 + 1.1643X 


a value of 1.50 for X, we have 


and 


y. - 1.6108 


Y = .621. 


We may expect a price approximately 62 per cent of trend. 
As a measure of the reliability of this estimate, we have 

Si = .1191. 

if 

This must be applied to the estimate in terms of reciprocals. 
Thus we have 

1.6108 + .1191 = 1.7299 
1.6108 - .1191 = 1.4917. 

Reducing these reciprocals to natural numbers we secure 
.578 and .670 as the desired values. The most probable 
price, then, is 62.1 per cent of trend, and, on the assump- 
tion of an approximately normal distribution of reciprocals 
about the curve, the odds are 68 out of 100 that the price 
will fall between 57.8 per cent of trend and 67.0 per cent 
of trend. The limits of 2 S and 3 S may be similarly deter- 
mined by adding to and subtracting from the estimate, 
as a reciprocal, amounts equal to twice .1191 and three 
times .1191. The results secured may then be converted 
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to natural numbers. Just as with logarithms, the value 
in absolute terms of a given difference between reciprocals 
varies at different points within the range of F-values. 
Accordingly, the limits of reliability determined from Si 

V 

should be expressed in natural numbers only after a particu- 
lar estimate has been made. 

A Comparison of Measures of Relationship 

In interpreting p similar considerations enter. The value 
of the index of correlation, as we have seen, depends upon 
the degree of variation about the curve, as compared with 
the variation about the average of the original dependent 
series. In handling natural numbers, variability about the 
fitted line is compared with the variability about the 
arithmetic mean of the dependent variable, both measured 
in absolute terms (i.e., S y is compared with <r„). In handling 
logarithms, variability about the fitted line is compared 
with variability about the arithmetic mean of the loga- 
rithms of the dependent series, variability being measured 
in each case in terms of logarithms. But logarithmic 
deviations, as we have seen, may be interpreted in terms 
of ratios. The logarithmic deviations from the line represent 
the ratios of actual values to computed, while logarithmic 
deviations about the arithmetic mean of the logarithms of 
the original series represent the ratios of the actual values 
of the dependent series to their geometric mean. The value 
of Piog v depends upon the relation between these respective 
deviations (i.e., S\ ogy is compared with <T\ oey ). 

In fitting a curve in which the reciprocals of the dependent 
variable are employed, variability about the fitted line is 
measured in terms of reciprocals, and the variability of 
the original series is measured in the same terms. That 
is, <x i is computed from the differences between the recipro- 

V 

cals of the actual values and the arithmetic mean of all 
these reciprocals. But the arithmetic mean of these recipro- 
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cals is the reciprocal of the harmonic mean. Thus, in. 
short, the value of the index of correlation, pi, depends upon 

V 

the relation between variability about the fitted line and 
variability about the harmonic mean of the dependent series, 
variation in both cases being measured in terms of reciprocals 
(i. e., Si is compared with cri). 

V V 

We have, therefore, three broad families of curves for 
describing the relationship between variable quantities. 
These are: 

1. Curves in the fitting of which natural values of the dependent 

variable are employed. Equations to all curves of this family 
will be of the type 

T =/(*). 

2. Curves in the fitting of which logarithms of the dependent 

variable are employed. In all such cases the equations will be 
of the type 

log y = /(*). 

3. Curves in the fitting of which reciprocals of the dependent 

variable are employed. For these curves the equations will 
be of the type 

\ = /(*)• 

In any one of these three cases the equations may be 
linear or non-linear. In so far as this problem of interpreta- 
tion is concerned, there is no limitation as to the function 
of X which may be employed. (The computation of S 
and p by the methods suggested above involves certain 
limitations, which are outlined elsewhere.) 

The standard error of estimate for the first family of 
curves is derived in terms of the original units of measure- 
ment (for the dependent variable) and has a direct and 
simple meaning in these terms. The index of correlation, 
for curves of this type, is a measure of the degree to which 
the absolute variability of the dependent variable may be 
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lessened by measuring deviations from the fitted curve 
instead of from the arithmetic mean. 

The standard error of estimate for the second family of 
curves is derived, by the method outlined, in terms of 
logarithms. It is more convenient in general to give it mean- 
ing in terms of ratios. The index of correlation, pi og v log *, 
is a measure of the degree to which the logarithmic or 
ratio variability of the dependent variable may be lessened 
by computing deviations (or ratios) with the fitted curve 
instead of the geometric mean as base. 

The standard error of estimate for the third family of 
curves is derived by the same process as in the other cases, 
but emerges as a reciprocal. The index of correlation, pi , 

if* 

is a measure of the degree to which the variability of the 
dependent variable, in terms of reciprocals, may be lessened 
by computing reciprocal deviations from the fitted curve 
instead of from the harmonic mean. 

FACTORS GOVERNING THE CHOICE OF MEASURES OF 
RELATIONSHIP 

It is clear, therefore, that the choice of a type of curve 
to describe a given relationship must be governed by basic 
considerations as to the type of average which is most 
appropriate as a measure of the central tendency of the 
given series. And this brings in a related question as to 
whether the dispersion about this average more nearly 
approximates the normal type when measured in absolute 
terms, in logarithms, or in reciprocals. In selecting a 
curve and in using the measures S and p there is always 
present an implicit assumption with respect to these points. 

When absolute values are important, and the dispersion 
of the dependent variable approaches the normal type when 
plotted on an arithmetic scale, measures of relationship of 
the arithmetic type would appear to be appropriate. But, 
as we have seen, in handling series in which rates of change 
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rather than absolute amounts of change are of primary 
importance and the dispersion appears to follow a geometric 
law, the arithmetic mean and other arithmetic measures 
are notoriously inadequate. In such cases logarithmic curves 
seem preferable to arithmetic, and measures of the reliability 
of estimates and of degree of relationship which are based 
upon ratios seem to be more suitable than those based upon 
absolute values. 

The harmonic mean has not been so widely employed as 
either of the above averages, and some attention may be 
given to principles governing its use in problems of the 
type here considered. In general, such harmonic measures 
are marked by the same weaknesses as the arithmetic, 
except that they err in the opposite direction. Geometric 
measures are perhaps better adapted to all-around employ- 
ment than either. Yet in one particular field of interest 
to the economist the harmonic mean is particularly appro- 
priate, and the utilization of reciprocals, as in the preceding 
example, seems to be justified. 

The use of the harmonic mean assumes a normal distribu- 
tion of reciprocals which, in natural numbers, means a 
much wider scatter above the average than below. The 
use of a curve of the type 

y - a + bX 

involves a similar assumption as to the relation between 
Y and X. A given absolute increase in X will be accom- 
panied by a certain decrease in the value of Y. The same 
absolute decrease in X will be accompanied by an increase 
in the value of Y which is larger than the decrease registered 
in the preceding case. But this is the relation which prevails, 
for many commodities, between the amounts produced and 
the price, the latter considered dependent. A given increase 
in production will cause some lowering of price. An equal 
decrease will cause a much greater increase in price. 
Moreover, when averaging the prices of such commodities 
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over a period, the harmonic mean may give a more typical 
value than any other average. 1 In such cases there is a 
strong a priori justification for using a curve of the reciprocal 
type and measuring the accuracy of all estimates in terms 
of harmonic relations. 


ARITHMETIC, GEOMETRIC, AND HARMONIC MEASURES 

The contrast between these different methods may be 
brought home most effectively by comparing the results 
obtained when curves of these three types are fitted to the 
same data. The computations involved in fitting curves 
of the second and third types (logarithmic and reciprocal) 
have been illustrated with reference to the data of oat 
production and prices (Table 132). A straight line (arith- 

1 “ Buyers and sellers of potatoes are frequently mistaken as to the price 
justified by fundamental economic conditions. If such an error is general in the 
fall, it may happen, for example, that the price which results is too high. If the 
price is too high in the early part of the season, potatoes will not be consumed 
fast enough to dispose of the supply available. Farmers and dealers will then 
find that not all of the stocks on hand can be sold at existing prices. Since 
potatoes can not be carried over from one year to the next, the price, under 
such conditions as have been mentioned, must be lowered enough to permit 
the supply to be disposed of before the end of the season. A properly adjusted 
price would remain the same throughout the season, except for a gradual ad- 
vance to cover cost of storage, and would maintain a fairly uniform consump- 
tion throughout the season. But since an abnormally high price early in the 
season causes small consumption, it must be compensated by an abnormally 
low price during the remainder of the season, or not all the crop can be sold. 

“ Similarly, if the price is abnormally low early in the season, the supply will 
be exhausted too rapidly and those who still have potatoes will find that they 
can get abnormally high prices for them during the remainder of the season.” 

But how, given the abnormally high or abnormally low prices during part of 
a season, may we compute the average price which would be justified by the 
true conditions of demand and supply, if these had been correctly estimated? 
Since “a low price during part of a season will be compensated only by a dis- 
proportionately high price during the remainder of the season” the arithmetic 
average for an entire season “ will be somewhat higher than the average which 
would have resulted had a proper price been established at the beginning of the 
season. This difficulty is eliminated hy taking the harmonic mean of the monthly 
prices .” 

Holbrook Working, Factors Determining the Price of Potatoes in St. Paul and 
Minneapolis. Technical Bulletin 10, University of Minnesota Agricultural 
Experiment Station, 8-10. 
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metic) is fitted to the same data, and the necessary accom- 
panying measures computed. The three sets of results are 
brought together in Table 135. 


Table 135 


Relation between the Production and Price of Oats, 1881-1913 
Comparison of Results of Curve Fitting 
(Prices are the dependent variable in each case) 

Standard error Index of 
of estimate correlation 


Equation 


A Y = 2.24 - 1.236X 
B = - .1357 + 1.1643X 


S„ = .12 


r vx = . 783 




.766 


y . AUl/ < | X . X U Si = .1191 

•* v 

C Log F = - .00861 - 1.18206 log X S hgv = .04901 p logvlogx = .793 


It is impossible to compare the three standard errors as 
they stand, since only the first one is in the original units 
of measurement (ratio of actual price to trend). In the 
following table are given estimates, based on each of these 
equations, as to the most probable price (in terms of ratio 
to trend) which would accompany each of five different 
conditions of production. 1 Each estimate is accompanied 
by a series of values which indicate the limits set by the 
standard error. Throughout, the values of the estimates 
plus and minus S, 2 S, and 3 S are given, in order to indicate 
the probable scatter of actual values about the estimates. 
The different amounts of variation which may be expected 
about each of the three lines of relationship are measured 
by the actual differences between the estimates and the 
limiting cases. These differences are given in the columns 
headed A. All values in this table are comparable, being 
reduced to the original units (ratio of actual price to trend). 

1 For the purpose of this illustration the limits of actual observation have 
been exceeded in setting up Table 136. Such extrapolation involves the pos- 
sibility of errors of another sort. With these we are not here concerned. 
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Table 136 


Comparison of Price Estimates and of Standard Errors of Estimate 
Based on Three Equations Relating to the Production and 
Price of Oats 


(1) 

Value 
of X 
(ratio 
of 
pro- 
duc- 
tion to 
nor- 
mal) 

(2) 

Estimated 
value of Y 
(ratio of 
price to 
trend) 
from 

arithmetic 

equation 

(A) 

(3) 

Limits of 
arithmetic 
estimate 

(4) 

A 

(5) 

Estimated 
value of V 
from 
reciprocal 
equation 
(B) 

(6) 

Limits of 
estimate, 
reciprocal 

(7) 

A 

(8) 

Estimated 
value of Y 
from 
logarith- 
mic 

equation 

(O 

(9) 

Limits of 
logarithmic 
estimate 

(10) 

A 



+33 = 1.982 

+ .36 


+33 = 11.223 

+8.983 


+38=3.114 

+ .890 



+28 = 1.862 

+ .24 


+28 = 4.803 

+2.563 


+23=2.780 

+ .566 

.5 

1.622 

+8 = 1.742 

+ .12 

2.240 

+8= 3.055 

+ 815 

2.224 

+8=2.491 

+ .267 



-8 = 1.502 

- .12 


-8= 1.768 

- .472 


-3 = 1.979 

- 245 



-28 = 1.382 

- .24 


-28= 1.461 

- .779 


-23 = 1.779 

- .445 



-38 = 1.262 

- .36 


-38= 1.244 

- .996 


-33 = 1 679 

-.645 



+38 = 1 611 

+ .36 


+38=2.281 

+ 1.024 


+38 = 1.786 

+ 510 



+23 = 1 491 

+ .24 


+28 = 1.794 

+ .537 


+28 = 1.595 

+ .319 

.8 

1 251 

+8 = 1.371 

+ .12 

1.257 

+8 = 1.478 

+ .221 

1 276 

+8 = 1 429 

+ .153 



-8 = 1.131 

- .12 


-8 = 1.093 

- .164 


-8 = 1 136 

-.140 



-28 = 1.011 

- .24 


-28= .967 

-.290 


-28 = 1 021 

-.265 



-38= .891 

- .36 


-38= .867 

-.390 


-38= .906 

— .370 



+38 = 1 364 

+ .36 


+38 = 1.490 

+ .518 


+38 = 1.372 

+ .392 



+28 = 1.244 

+ 24 


+28 = 1.265 

+ .293 


+28 = 1.226 

+ .245 

1.0 

1 004 

+8 = 1.124 

+ .12 

.972 

+8 = 1.100 

+ .128 

.980 

+8 = 1.098 

+ .118 



-8= .884 

- .12 


-8= .871 

- .101 


-8= .872 

- .108 



-28= .764 

- .24 


-28= .789 

- .183 


-28= .784 

-.196 



-38= .644 

- 36 


-38= .722 

- 250 


-38= .696 

-.284 



+33 = 1.117 

+ .36 


+38 = 1.106 

+ .313 


+38 = 1.106 

+ .316 



+28= .997 

+ .24 


+28= .977 

+ .184 


+28= .987 

+ .197 

1.2 

.757 

+8= .877 

+ .12 

.793 

+8= .875 

+ .082 

.790 

+8= .886 

+ .095 



-8= .637 

- .12 


-8= .724 

- 069 


-8= .703 

-.087 



-28= .517 

- .24 


-28= .667 

- .126 


-28= .632 

-.158 



-33= .397 

- 36 


-38= .618 

-.175 


-38= .561 

-.229 



+38 = .746 

+ 36 


+38 =.798 

+ .177 


+38= 852 

+ .245 



+28 = .626 

+ .24 


+28-.728 

+ .107 


+28 = 761 

+ 164 

1.5 

.386 

+8 =.506 

+ .12 

.621 

+8 =.670 

+ .049 

.607 

+8-.680 

+ .073 



-8 = .266 

- .12 


-8= .578 

- 043 


— 8 = . 542 

-.065 



-28= 146 

- .24 


-28= 541 

- 080 


-28 = .484 

-.123 



-33= .026 

- 36 


-38= 508 

- .113 


-38= 433 

-.174 


Zones of Estimate and Their Significance 

A careful study of this table should make clear the nature 
of estimates based on the three types of equations here 
presented. The fundamental differences lie not so much 
in the actual values of the estimates, as in the standard 
errors which measure the reliability of these estimates and 
indicate the limits within which the actual values are likely 
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Fig. 87. — The Relation between the Production and Price of Oats: 
Illustrating the Use of an Arithmetic Equation of Regression and Arith- 
metic Zones of Estimate 


to fall. In other words, the differences lie in the assumptions 
made as to the character of the scatter about the curves. 

The measure S v , which relates to the arithmetic curve, 
gives the same absolute range to errors of estimate whether 
the estimated value be high or low. An arithmetic dispersion 
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Fig. 88. — The Relation between the Production and Price of Oats: 
Illustrating the Use of a Logarithmic Equation of Regression and Geo- 
metric Zones of Estimate 


about the curve is assumed. In each case the estimate 
is the arithmetic mean of the value which exceeds the 
estimate by an amount equal to S v (or any multiple of S„) 
and the value which falls below it by an equal amount. 
These conditions are brought out graphically in Fig. 87. 




ZONES OF ESTIMATE 


593 


The original points are plotted, the straight line of relation- 
ship (arithmetic) is shown, and zones of estimate having 



Production Ratio 


Fig. 89. — The Relation between the Production and Price of Oats: 
Illustrating the Use of a Logarithmic Equation of Regression and Geo- 
metric Zones of Estimate (Plotted on Double Logarithmic Paper) 

widths, respectively, of 2 S, 4 S, and 6 S, centering at the 
fitted line, are marked out. 

The measure <S log „ gives the same relative or percentage 
range to errors of estimate, whether the estimate be high 
or low. This means that the absolute range within which 
the actual values should fall is much less when the estimates 
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are low than when they are high. It assumes a geometric 
dispersion about the curve which describes the relationship. 
The estimate is, in this case, the geometric mean of the 
value which exceeds it by an amount equal to Si oey (or 
any multiple of Si og v ) and the value which falls below it 
by an equal amount. Fig. 88 presents these relationships 
graphically. The original data are here plotted, together 
with the graph of the equation 

Y = . 9804 A -1 18208 . 

There are shown, also, the limits of zones of estimate having 
widths equal, respectively, to 2Sr, 4 Sr, and 6 Sr, centering 
(geometrically) at the line of relationship. A comparison 
of Fig. 87 and Fig. 88 will reveal the differences between 
estimates based on the assumption of an arithmetic distribu- 
tion and those based on the assumption of a geometric 
distribution. 

The points and lines shown in Fig. 88 are plotted on a 
logarithmic scale in Fig. 89. On this scale the curve of 
relationship becomes straight, and the zones of estimate 
appear as symmetrical and of equal width throughout the 
range. This transformation when the data are plotted on 
logarithmic paper makes clear the fundamental simplicity 
of the assumptions involved in making estimates from 
logarithmic values. 

In using the measure <Si we carry still further the assump- 

V 

tion that the variability about the curve is greater with 
high prices than with low. It shows a very limited range 
to errors of estimate when the estimate is low and a very 
wide range when the estimated price is high. A harmonic 
dispersion about the curve is assumed. The computed 
value, or estimate, is always the harmonic mean of the 
value which exceeds it by an amount equal to Si (or any 

V 

multiple of Si) and the value which falls below it by an 

V 

equal amount. 
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Fig. 90. — The Relation between the Production and Price of Oats: 
Illustrating the Use of an Equation of Regression Based upon Reciprocals, 
and of Harmonic Zones of Estimate 


In Fig. 90 the curve p = — . 1357 + 1 . 1643X is plotted, 

together with the original observations. Zones of estimate 
with widths of 25i, 45i, and 65 1 , centering (harmonically) 

V V V 

at the fitted line, are shown. The differences between this 
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figure and each of the two preceding are quite marked, 
particularly with respect to the zones of estimate. On the 
assumption of a normal harmonic distribution about the 
curve describing the relationship, the outer zone (with width 
equal to 6/S) marks the limits within which 99.7 per cent 
of all the points should fall, and the inner zone (with width 
equal to 2/S) marks the limits within which 68 per cent 
of all the points should fall. By plotting reciprocals through- 
out, instead of natural numbers, this apparently abnormal 
distribution could be reduced to the symmetrical form 
seemed in plotting the geometric values on the logarithmic 
chart. 

For both high and low estimates the geometric measure, 
Sioe y> stands between the arithmetic measure, S u , and the 
harmonic measure, Si. While the two latter have their 

V 

particular functions, and are appropriate in certain cases, 
it is probably true that in using such methods as these in 
economic analysis, measures of the geometric family are 
more generally useful than those of the other types. This 
means, merely, that ratios are usually more important 
than absolute differences. It seems reasonable therefore 
to base estimates upon an equation of the type 

Log Y — f{X) 

and to measure the reliability of these estimates in terms 
of logarithms or ratios, using S\ og „ or S r . In such cases, as 
we have seen, correlation is measured by p\ og „ i og * or p ]og 
The value of this index depends upon the ratio variability 
about the curve, as compared with the ratio variability 
about the geometric mean. 1 

1 The reasoning in C. M. Walsh's book, The Problem of Estimation (London, 
King, 1921, p. 12.) is peculiarly applicable to the present problem. Citing Gali- 
leo, in defence of the use of the geometric mean in averaging estimates, Walsh 
writes: “And so errors must be measured by an error which is a ratio between 
the estimate and the true quantity , and not a concrete quantity itself . We cannot 
measure errors by so many pounds, feet or crowns; we must measure them 
by the proportions of the pounds, feet or crowns in the erroneous estimates to 
the pounds, feet or crowns in the thing estimated.” (Italics mine.) This ar- 
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CHAPTER XVIII 


STATISTICAL INDUCTION AND THE PROBLEM 
OF SAMPLING, CONCLUDED 

The methods of induction discussed in an earlier section 
(Chapter XIV) dealt with the more familiar procedures 
employed in generalizing results secured from the study 
of samples. Certain research problems call for modifications 
of the methods there described, while for some purposes 
quite different instruments are needed. In the present 
chapter, therefore, we carry forward the discussion of 
statistical inference, considering methods appropriate to 
certain special conditions and special problems. 

Generalizing from Small Samples 

The standard error of an arithmetic mean, we have seen, 
is given by 

aM = v^ 

where N is the number of observations in the sample and 
<r is the standard deviation of the population from which 
the sample is drawn. We do not know the standard deviation 
of the population but we approximate it from the standard 
deviation of the sample. (For convenience in this exposition 
we shall use s as a symbol for the standard deviation of 
the sample; <t will denote the standard deviation of the 
population.) This is an acceptable approximation when 
N is reasonably large, say 30 or more. But for small values 
of N the standard deviation of the sample is subject to 
a definite bias, tending to make it consistently lower than 
the standard deviation of the population. The value of 
a u derived by the customary method is also biased down- 

698 
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ward. Therefore, when methods appropriate to large 
samples are employed with small samples, we consistently 
under-estimate the sampling errors to which our measure- 
ments are subject. This bias shows remarkable consistency, 
however. With samples of any stated size the magnitude 
of the error to be expected from the use of the standard devia- 
tion of the sample as an approximation to the standard devia- 
tion of the population may be determined, and correction 
made for it. Accordingly, generalization of results secured 
from small samples is possible. In the nature of things the 
margin of error in such generalization is larger than it is when 
large samples are used, but the distortion due to sheer 
bias may be avoided . 1 

The nature of the error involved in generalizing from 
small samples may be brought out in the following terms. 
If we represent by M the mean of the population from 
which a sample is drawn, by X the mean of a single sample, 
and by a- the standard deviation of a distribution of a 
number of X’s computed from successive samples, we may 
write 

T _ A - M 

O- 

X 

The quantity T is the deviation of the mean of the sample 
from the mean of the population, expressed in units of the 
standard deviation of the sample means. When a- is 
determined from the actual distribution of a number of 
X’s, or from the true standard deviation of the population 
and N of the sample, the quantity T may be interpreted 
as a normal deviate. The significance of given values of 
T may then be determined with reference to a table of 
areas under the normal_ curve. Actually, we do not 
have a large number of X’s, which may be arranged in a 
frequency distribution, nor do we know the value of <r 

1 The bias involved in the use of s as an approximation to cr, for small sam- 
ples, was first discovered by “Student.” For the original memoir see “The 
Probable Error of the Mean,” Biomelrika> Vol. 6, 1908, 1-25. 
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(the standard deviation of the population), nor of a- (the 
standard error of X). We approximate a by s (the standard 
deviation of the sample) and <r- x by what we may call s* 

^s x = ~ if « has been computed from > s * ~ 

if s has been computed from When these ap- 


proximations are based upon small samples, the T derived 
from them may not be interpreted as a normal deviate. For 
the distribution of T varies with the size of the sample. With 
small samples the distribution departs significantly from the 
normal type. Statistical inferences that fail to take account 
of this are inaccurate. 

A dis cussion in detail of the distributions of statistical 
measurements obtained from small samples would carry 
us beyond the scope of the present book. We may briefly 
note, however, certain characteristics of the distribution 
function of the standard deviation. These are effectively 
revealed by the results of an interesting experiment con- 
ducted by W. A. Shewhart. 

Shewhart drew 1,000 samples, each consisting of four 
observations, from a normally distributed parent population 
with a known standard deviation, equal to unity. 1 The 
standard deviation, s, of each sample was computed. The 
distribution of these thousand values of s is represented by 
the dots in Fig. 91. 2 (The line running through the dots 
defines the theoretical distribution of s’s to be expected, 
with samples of 4, on the basis of “Student’s” theory. 
There is a notably close agreement between the theoretical 
and observed distributions.) Traditional sampling concepts 
would lead us to expect a normal distribution of s’s, center- 
ing about 1, the value of <r in the parent population. 
Instead, the distribution is definitely skew, with the meas- 

1 W. A. Shewhart, Economic Control of Quality of Manufactured Product , 
New York, Van Nostrand, 1931, 163-173, 185-186. 

2 The figure is here reproduced with the permission of Dr. Shewhart and his 
publishers. 
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urements clustering about a central tendency well below 
unity. The mode of the thousand values of s here repre- 
sented is, in fact, .717 and the arithmetic mean is .801. 
These s’ s, it will be recalled, represent estimates of <r. 



0.075 0.375 0.675 0.975 1.275 1.575 1.875 

Standard Deviation d 

Fig. 91. — Distribution of Standard Deviations in Samples of Four Drawn 
from a Normal Universe 

There is a clear tendency for such estimates, based on 
samples of four, to understate the true value. 

The symbol T has been used above to define the deviation 
of a statistical measure from some standard or hypothetical 
value, expressed in units of the estimated standard error 
of the measure in question, when the deviation, so expressed, 
could be interpreted as a normal deviate. In the present 
exposition we shall employ the symbol t to relate to approxi- 
mations to T when these approximations are based on 
small samples. 

The difference between T and t may be reduced to more 
definite terms. If we let a; = X — M, we may write 
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We may derive t from T: 

t= — 4 - i? = 

0 * ’ <T X 

The normally distributed quantity, xl<r- x , has been divided 
by the factor Si/<r- x , to give the quantity t. Opportunity to 
correct for the bias is given us, however, by the fact that 
the distribution of s x /a x is known. Thus the probability 
corresponding to any stated value of t may be determined 
(when t defines a departure from a certain hypothetical 
value, measured in units of s*). 1 

It is of some interest to compare values of t corresponding 
to stated probabilities, for samples of varying sizes, with 
values of T corresponding to the same probabilities. This 
is done in Table 137. 

The familiar values given in the customary table of 
areas under the normal curve appear on the last line of 

1 The degree of error involved in using s as an approximation to <r, for small 
samples, is indicated by the following figures, taken from W. A. Shewhart 
( loc . tit ., 185). They define the relation between the modal s, for samples of 
size N drawn from a population of which the standard deviation is known, and 
the true a of that population. 

Size of sample Modal s as a decimal fraction of true a 

N 


3 

.577 

4 

.707 

5 

.775 

6 

.817 

7 

.845 

8 

.866 

9 

.882 

10 

.894 

15 

.931 

20 

.949 

25 

.959 

30 

.966 

50 

.980 

100 

.990 


The fractions given above define relations that are to be expected on the 
basis of error theory, as modified by “Student” to take account of conditions 
affecting small samples. The modal value of the 1,000 standard deviations 
obtained by Shewhart in his empirical test of this theory was, as we have seen, 
. 717 of the standard deviation of the universe. This result is very close indeed 
to the expected value of . 707, for samples in which N = 4. 
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Table 137 

Values of t and T Corresponding to Stated Probabilities 1 


n Probability 



.80 

.50 

.40 

.20 

.10 

.05 

.01 

1 

.326 

1.000 

1.376 

3.078 

6.314 

12.706 

63.657 

2 

.289 

.816 

1.061 

1.886 

2.920 

4.303 

9.925 

3 

.277 

.765 

.978 

1.638 

2.353 

3.182 

5.841 

4 

.271 

.741 

.941 

1.533 

2.132 

2.776 

4.604 

5 

.267 

.727 

.920 

1.476 

2.015 

2.571 

4.032 

6 

.265 

.718 

.906 

1.440 

1.943 

2.447 

3.707 

7 

.263 

.711 

.896 

1.415 

1.895 

2.365 

3.499 

8 

.262 

.706 

.889 

1.397 

1.860 

2.306 

3.355 

9 

.261 

.703 

.883 

1.383 

1.833 

2.262 

3.250 

10 

.260 

.700 

.879 

1.372 

1.812 

2.228 

3.169 

20 

.257 

.687 

.860 

1.325 

1.725 

2.086 

2.845 

30 

.256 

.683 

.854 

1.310 

1.697 

2.042 

2.750 

00 

.25335 

.67449 

.84162 

1.28155 

1.64485 

1.95996 

2.57582 


Table 137, for n = « . These are the values of T, as a nor- 
mal deviate, corresponding to probabilities of .80, .50, etc. 
Thus, when we are dealing with infinitely large samples, 
the probability of a given sample yielding a value of T 
as great as .25335 or greater (either above or below the 
mean) is .80. (The area between the maximum ordinate 
and an ordinate erected at + .25335 is 10 per cent of the 
total area under the normal curve. Twenty per cent of 
the total area will fall within ± .25335, and 80 per cent 
will fall beyond these limits.) Similarly, just 50 per cent 
of the values of T will exceed the limits ± .67449 ; 5 per 
cent will exceed the limits ± 1.95996; 1 per cent will 
exceed the limits ± 2.57582. 

As n grows smaller each of these limits must be extended, 
if the probabilities are to remain constant. For samples 
in which n is equal to 10, 50 per cent of the values of t will 

1 The entries in this table are extracts from a more detailed table (Table IV) 
in R. A. Fisher’s Statistical Methods for Research Workers r Edinburgh, Oliver 
and Boyd, sixth edition, 1936. The table is printed here through the courtesy 
of Dr. Fisher and his publishers. 
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fall beyond the limits ± .700; 5 per cent will exceed the 
limits ± 2 . 228, and 1 per cent will exceed the limits 
± 3 . 169. (The letter n in Table 137 refers to the number 
of degrees of freedom in the computation of t. This general 
concept has been discussed in Chapter XV. When the 
arithmetic mean of a sample is being tested for significance, 
n = N — 1.) If in applying various statistical tests we attach 
significance to a given level of probabilities, such as 5/100 
or 1/100, we must recognize that the values of t corre- 
sponding to these probabilities vary with n. Fortunately, we 
now know how these values vary and, using such a table as 
that given above, may make allowance for the variation. 

For convenience in exposition we have distinguished T, 
as a normal deviate, from t, a similar deviate relating to a 
distribution of quantities derived from small samples, and 
therefore not normal. The probabilities corresponding to 
a given value of T are not the same as the probabilities 
corresponding to an identical value of t. Indeed, these 
probabilities vary for the same value of t computed from 
samples of different sizes. The distinction between T and 
t need not be preserved, however. We may use t generally 
to define the deviation of a statistical measure from some 
standard or hypothetical value, expressed in units of the 
standard error of the measure in question. The quantity 
t is to be interpreted as a normal deviate when large samples 
are dealt with. The interpretation is modified in dealing 
with small samples, as we have seen. The nature of the 
modification required is shown by the entries in Table 137 
and in Appendix Table II. 

EXAMPLES OF TESTS BASED ON i-TABLE 

In determining whether the mean of a sample deviates 
significantly from any stated value we may compute t 
from the relation 

. X — M 
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where X is the mean of the sample, M is the stated value 
and Si is an approximation to the standard error of X. 
For this approximation we have 


s 

Sz — 

VN 


where s is the standard deviation of the sample (here com- 


puted from 



The value t, which for larger sam- 


ples we have interpreted with reference to a table of areas 
under the normal curve, we here interpret with reference 
to the special stable for small samples. In using the stable 
for this purpose we take n of that table as equal to N — 1. 

For the six New England states the average earnings 
of factory workers in 1935/ as indicated by census returns, 
were as follows: 


Maine 

$ 851 

New Hampshire 

892 

Vermont 

940 

Massachusetts 

1,007 

Rhode Island 

938 

Connecticut 

1,016 

Average 

$ 940.67 


For s we obtain the figure $63.99. 

s $63 . 99 


the mean is s* 


The standard error of 
= $26.13. 


Vn Vq 

Does the average of annual earnings of factory workers 
in the six New England states differ significantly from 
$1,022, the average for the country as a whole? Computing 
t we have 

x-M $ 940 . 67 -$ 1,022 


t = 


$ 26.13 


= -3.11. 


1 These averages, and similar ones cited below, are derived by dividing the 
total wages paid by the average number of wage-earners employed during the 
year. Part-time workers are included. The averages do not represent full-time 
earnings, therefore. 
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Consulting the stable with n = 5 we find that for P = . 01, 
2 = 4.032. The observed deviation is not as great as this. 
If our standard is a P of .01, the average for the New 
England states is not to be judged significantly less than 
the average for the country as a whole. If the standard 
were a P of . 05, however, the deviation would be considered 
significant. 

Similarly, we may test with reference to the 2-table the 
significance of a difference between two means, computed 
from small samples. In this case we obtain 2 from the 
relation 

, _ X - X* t / NJf* 
s V N, + N t 

where the X’s and N’s have the customary meanings, and 
s is, in effect, an average standard deviation of the two 
distributions. For 


s 



Srfi 2 + 2 d 2 2 
Nx + N 2 - 2 


Here d x and d 2 are used, respectively, to denote deviations 
of given observations from the means of the two distribu- 
tions. The value 2, as derived above, corresponds to 2 — — , 

CD 

where D is the difference between two means and <r D is 
the standard error of that difference. For small samples, 
however, the customary formula for a D is modified some- 
what, and the special 2-table rather than the table of normal 
deviates is used. In consulting the 2-table in a problem of 
this type, n is taken as equal to Ni + N 2 — 2. 

Average earnings of workers employed in manufacturing 
plants in six Southern states, in 1935, are shown below: 


North Carolina 

$662 

South Carolina 

615 

Georgia 

599 

Tennessee 

744 

Alabama 

640 

Mississippi 

541 

Average 

$633.50 
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Does this average differ significantly from the mean earnings 
in six New England states in the same year? For the com- 
putation of s we have 

and for t 

_ $940.67 - $633.5 /36 

$66.06 V 12 

= 8.05. 

In the f-table, for n = 10, we find that the value of t corre- 
sponding to a P of . 01 is 3 . 169. The present value is clearly 
significant. The two samples could not have come from 
one homogeneous parent population. 

The f-table has particular value in connection with the 
interpretation of coefficients of regression. We may have 
observed that a given variable, Y, appears to increase by 
a constant increment or at a constant rate as another 
variable, x, changes in value. The degree of relationship 
between the two variables may be measured in terms of r, 
the coefficient of correlation, but special interest often 
attaches to the functional relationship and, in particular 
to the apparent regression of y on x. Does b of the equation 
of regression 

Y = a + bX 

depart significantly from zero, or from some other value 
which has significance for the purpose in mind? Here we 
must judge b with reference to the sampling errors to which 
it is exposed. 

A general test of this type was applied in an earlier 
section (Chapter XIV), in seeking to determine whether 
average corn yield in Kansas had shown a significant decline 
over the period 1890-1933. For smaller samples we may 
compute t by exactly the methods there presented, but we 
should interpret t with reference to the special stable 
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adapted to small samples. As a general formula we have 

t = b -^ 

where b is a coefficient of regression and /3 is a norm with 
reference to which we wish to judge the given value of b. 
For the standard error of b we have 


where 


<Tb 


s v 

Vzi? 




Y c )« 
2 


(In these expressions, x = X — X, Y is an observed value 
of the dependent variable, and Y c is the corresponding 
computed value.) In interpreting the value of t thus secured, 
the i-table is employed with n — N — 2. 

This test may be extended to the comparison of two 
coefficients of regression. The series in Table 138 provide 
an illustration. 


Table 138 

Aggregate Values of Loans on Securities and Commercial Loans , 
Reporting Member Banks, Federal 
Reserve System, 1922-1929 
(In hundreds of millions of dollars) 


Year 

Loans on 

Commercial loans 

securities 

(“all other loans”) 

1922 

39 

73 

1923 

41 

78 

1924 

45 

80 

1925 

53 

82 

1926 

57 

86 

1927 

62 

87 

1928 

69 

89 

1929 

77 

92 


For loans on securities the trend (i.e., the equation of 
regression of volume of loans on time) is defined by 
Vi = 30.63 + 5.49-Xj. 
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The corresponding equation for commercial loans is 
F 2 = 72.13 + 2.54X 2 . 

In each case the origin is at 1921. The eight-year period 
was marked by an increase of loans on securities which was 
much more rapid than the corresponding advance in com- 
mercial loans. We must ask, however, whether the difference 
between the two coefficients of regression is really signifi- 
cant, if account be taken of sampling fluctuations. 

The coefficients to be compared are 

bi = 5.49 
h = 2.54. 

In testing whether 6i — b 2 is significant (i.e., deviates 
significantly from zero) we must compute 



(Tbi — 


<r tl _ t, being, of course, the standard error of the difference 
between the two coefficients of regression. For this standard 
error we have 

_ l / SI ■ Sy 2 ~ 

h - k ' ~ y s(xi s ) 

where x x and x 2 are given values of the two variables, 
expressed as deviations from their respective arithmetic 
means, and 

„ , 2(Fi - Fei) 2 + 2(F 2 - Y C ,Y 
* v ~ Ai + A 2 -4 

S y 2 is a measure of the average scatter about the two lines 
of regression. 

In the present example we have 

-V 

(Tbi-b, 

t 
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For the interpretation of this value of t we enter the stable 
with n = Ni + N 2 — 4 = 12. In this case the value of t 
far exceeds the value of 3.055, corresponding to P = .01. 
The results are not consistent with the hypothesis that 
the true value of bi — is zero. The trends of the two 
series differ significantly. (Here, again, the reader should 
bear in mind that such tests of significance apply only 
with important qualifications to economic series that are 
ordered in time.) 

Sampling Errors of Coefficients of Correlation 
Computed from Small Samples 

As a general formula for the determination of the standard 
error of the coefficient of correlation we have made use of 

1 — r 2 

“ VN- l' 

In error theory, the r that appears in the numerator of 
the right-hand member of this equation is the coefficient 
of correlation in the universe from which the sample in 
question is drawn. But this r is not known. Our best 
approximation to it is the r derived from the sample. Here, 
again, we face distortion in small samples, a distortion 
that is the greater the higher the value of the true correla- 
tion. The nature of this bias may be readily understood. 
If we are drawing samples from a universe in which the 
true value of r is + .95, the range of the possible variation 
of the sample r’s above the true r is only .05. But the 
range of possible variation below the true value is 1.95 
(i.e., from + .95 to — 1.00). Accordingly, a distribution 
of r’s obtained from a great many small samples from this 
universe will be sharply skew. An estimate of the true 
value based upon a sample value will be subject to corre- 
sponding bias. This bias will not be present when the 
population value of r is zero. (The distribution of sample 
r’s when the population value of r is zero will be symmetrical 



SMALL SAMPLES 


611 


but will depart somewhat from the normal type in other 
respects.) It will not be pronounced when the samples 
are large, even for high values of r. But when samples 
are small and the population value of r departs materially 
from zero, substantial inaccuracy results from the use of 
the formula given above. 

Allowance may be made for this bias by use of the table 
showing the distribution of t, for samples of various sizes. 
R. A. Fisher has shown that the procedure employed in 
deriving f, in testing whether a coefficient of linear regression 
differs significantly from zero, may be used, with an algebraic 
modification of the mathematical expression, in determining 
the significance of r. If we are testing the hypothesis that 
a sample from which a given r has been computed was drawn 
from a population in which the true value of r is zero, we 
may compute t from the relation 

2 

Vl — r 2 

This is equivalent, of course, to dividing the quantity r — 0 
(i.e., the deviation of the given r from the hypothetical 
value of zero) by Vl — r 2 /V/V — 2. In consulting the 
f-table for the interpretation of the values thus obtained, n, 
the number of degrees of freedom, is taken as equal to A — 2. 

As an illustration, we may test the results obtained from 
a study of the relation between the production and the 
price of cotton in the United States, covering 35 observa- 
tions. The value of r is — . 65. We have 


- .65V35 - 2 

Vl -(- . 65p 


4.91. 


In consulting the <-table we find that for n = 33 the value 
of t corresponding to a probability of 1 per cent 1 is approxi- 


1 This probability refers to the likelihood of deviations above or below the 
assumed true value of zero. It corresponds to the sum of areas at both extrem- 
ities of a frequency curve. We may divide it by two to obtain the probability 
of a deviation of the stated magnitude in one direction only from the hypothet- 
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Table 139 


Values of the Correlation Coefficient for Different Levels of 
Significance 1 


n 

P = .05 

P = .02 

P = .01 

1 

.996917 

.9995066 

.9998766 

2 

.95000 

.98000 

.990000 

3 

.8783 

.93433 

.95873 

4 

.8114 

.8822 

.91720 

5 

.7545 

.8329 

.8745 

6 

.7067 

.7887 

.8343 

7 

.6664 

.7498 

.7977 

8 

.6319 

.7155 

.7646 

9 

.6021 

.6851 

.7348 

10 

.5760 

.6581 

.7079 

11 

.5529 

.6339 

.6835 

12 

.5324 

.6120 

.6614 

13 

.5139 

.5923 

.6411 

14 

.4973 

.5742 

.6226 

15 

.4821 

.5577 

.6055 

16 

.4683 

.5425 

.5897 

17 

.4555 

.5285 

.5751 

18 

.4438 

.5155 

.5614 

19 

.4329 

.5034 

.5487 

20 

.4227 

.4921 

.5368 

25 

.3809 

.4451 

.4869 

30 

.3494 

.4093 

.4487 

35 

.3246 

.3810 

.4182 

40 

.3044 

.3578 

.3932 

45 

.2875 

.3384 

.3721 

50 

.2732 

.3218 

.3541 

60 

.2500 

.2948 

.3248 

70 

.2319 

.2737 

.3017 

80 

.2172 

.2565 

.2830 

90 

.2050 

.2422 

.2673 

100 

.1946 

.2301 

.2540 


ical value. In most problems of the type here discussed it is conservative prac- 
tice to test given results with reference to the probability of a deviation of 
given magnitude, without consideration of the direction of deviation. The 
tabulated values of t lend themselves to this procedure. 

1 This table is printed here through the courtesy of R. A. Fisher and his 
publishers, Oliver and Boyd, of Edinburgh. The original appears as Table V.A 
of Statistical Methods for Research Workers. 
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mately 2.73. If the true value of t were zero, a value as 
great as 2.73 or greater would occur only 1 time out of 
100, as a result of chance fluctuations of sampling. The 
present value of t is substantially greater than 2.73. It 
it highly improbable that it reflects a chance drawing from 
a population in which the true value of t (and, of course, 
of r) is zero. The results we have obtained are not, then, 
consistent with the hypothesis that the true value of r is 
zero. There appears to be a significant negative correlation 
between the production and the price of cotton. 

If we are seeking to determine the significance of given 
coefficients of correlation with reference to hypothetical 
values of zero, use may be made of a table prepared by 
R. A. Fisher, showing the values of correlation coefficients 
at stated levels of significance. Selected values from this 
table are given in Table 139 and in Appendix Table III. In 
simple correlation problems, this is to be read with n equal 
to N - 2 (the numoer of pairs of original observations 
less 2). In determining the significance of coefficients of 
partial correlation the number of variables held constant 
is also subtracted from N. 

The use of the table requires little explanation. If a 
sample is based on 12 pairs of observations, with n equal 
to 10, we would require a coefficient at least as high as 
.7079 before we accept it as significant, if our standard 
of significance is P = .01. For only 1 time out of 100 
trials would a sample of 12 drawn from an uncorrelated 
population yield a value of r as great as .7079. If our 
standard of significance is P = .05 we would accept as 
significant of a real relationship an r of .5760, or greater, 
obtained from a sample of 12. 


TRANSFORMATION OF T TO Z 

The sampling limitations attaching to r have led R. A. Fisher 
to utilize as a general measure of linear correlation a loga- 
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rithmic function of r that possesses certain distinctive merits. 1 
In effecting the transformation we have 

2 = | {log«(l + r) - log„(l - r)}. 

Conversely 

r = (c J * - 1) 4- (e*‘ + 1). 

The scales of possible values of r and z are, of course, quite 
different. For r = 0, z = 0, and for r = 1, z - <». Nega- 
tive values of r give negative values of z. The relations 
between the two functions, at different levels of correlation, 
are shown by the entries in Appendix Table IV. Transfor- 
mation may be more readily effected by means of this 
table than from the relations given above. 

There are certain highly important advantages in this 
transformation. Not least is the replacing of r by a function 
with a distribution of values corresponding more closely 
to the true significance of observed correlations than do 
those of r. Thus a change in the value of r from .88 to 
.98 is equivalent, on the r scale, to a change from .20 to 
.30. But the first of these differences represents, on the 
z scale, a change from 1.38 to 2.30 (a range of . 92) while 
the second represents a change in z from .20 to .31 (a 
range of .11). The first difference, on the z scale, is over 
8 times more significant than the second. In this the z 
scale gives a far more accurate representation of the true 
significance of observed correlations than does the r scale. 

More important than this, however, is the fact that the 
distribution of z is much closer to the normal type than is 
that of r; in particular, the distribution of z is not subject, 
as is that of r, to marked variations in form with variations 
in the degree of correlation in the population. The form 
of the distribution of z is virtually independent of the 
degree of correlation. As a result, the sampling errors to 
which z is exposed may be estimated with considerable 

1 See Statistical Methods for Research Workers, Chapter VI. 
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accuracy. For the standard error of z we have 

<7 ' = 


This standard error, it is to be noted, is a function solely 
of N. It is independent of the true value of z in the parent 
population. 

From the example in Chapter XVI we obtained a coeffi- 
cient of partial correlation of — .2923 between corn yield 
per acre in Kansas and average June temperature, holding 
constant effects of changes in July and August tempera- 
tures. Referring to Appendix Table IV we have, for 
r = — .2923, z = — .301. In computing the standard 
error of a coefficient of partial correlation we must subtract 
from N the number of variables held constant. Since N 
equals 44 in the example in question, we treat the coefficient 
of partial correlation as we would a simple coefficient based 
on 42 observations. For the standard error of z we have, 
then, 


1 

<r ' V42^3 


.160. 


With reference to this result we may determine whether z 
differs significantly from zero. For the test we must have 


z — 0 — .301 

<r, .160 


1 . 88 . 


We interpret 1.88 as a normal deviate. It is clear that it 
is not large enough to indicate that z is significant. The 
result is not inconsistent with the hypothesis that the true 
value of z (and hence of r) is zero. 

If, however, we test the coefficient = — .4057, from 
the same example (defining the relation between corn yield 
per acre and August temperature, with June and July tem- 
peratures held constant), we have 


z — 0 — .430 

( t. . .160 


2.69. 



616 STATISTICAL INDUCTION 

This result is clearly significant. So, also, is the measure 
rw. 24 = — .6101, the coefficient of partial correlation be- 
tween corn yield and July temperature, with June and 
August temperatures held constant. 

The procedure would be similar, of course, if we were 
testing the significance of the deviation of an observed 
value of z from a theoretical value other than zero. 

The transformation to z makes possible, also, an accurate 
test of the significance of the difference between two 
observed correlations. The standard error of the difference 
between two values of z is given by 

<r D . = |/ Ni _ 3 + # 2 _ 3 

where Ni is the number of pairs of observations in the first 
sample, N 2 the number in the second. 

This test may be illustrated with reference to observations 
on the timing of price changes during business cycles. 
For 111 commodities we have observations on the timing 
of price declines in two successive periods of business 
recession occurring in the late 90’s and early 1900’s. The 
degree of relation between the time sequences of commodity 
price changes in these two recessions is indicated by a 
coefficient of correlation of + .22. For two similar (suc- 
cessive) periods in the 1920’s the measure of correlation, 
based on the prices of 121 commodities, has a value of 
+ . 36. There appears to have been a closer approach to a 
common pattern in the later period than in the earlier. 
In testing the significance of the difference between the two 
results we set up the hypothesis that the two samples were 
drawn from the same parent population, and that therefore 
the true value of the difference between the two coefficients 
is zero. 

For the two samples we have 

r - .22;* - .223; ^ i - .0093 
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r = .36; z — 


.377; 


1 

N,- 3 


1 

118 


.0085. 


The difference to be tested is 


D, = .377 - 223 = .154. 


The standard error of this difference is 


a Di = \/.0093 + .0085 = .133. 

We wish to know whether Dz is significantly different from 
zero. We compute, therefore, 

D z - 0 .154 - 0 , 1fl 

= — r6o — = l • 16. 

•r Dz . 133 

Interpreting 1.16 as a normal deviate, we conclude that 
the difference is not significant. Dz differs from the hypothet- 
ical value of zero by only slightly more than one standard 
deviation. The results are not inconsistent with the 
hypothesis that the two samples are drawings from the 
same parent population. There is here no clear evidence 
that the degree of relationship between price movements 
in successive cycles was closer in the 1920’s than in the 
earlier period. 1 

Finally, making use of the z-transformation, we may 
combine results secured from the measurement of corre- 
lation in different samples. If we have two values of r, 
obtained from samples drawn from the same population, 
a weighted average of the two will provide a better estimate 
of the true correlation than will either of the r’s, taken 
separately. For the averaging process we transform the 
r’s to z’s, weight each z by the corresponding N, less 3, 
and average them. Then, if desirable, the corresponding 
value of r may be determined. We may note that the 

1 The time factor enters to cloud statistical inductions relating to samples 
drawn from different periods (see above, Chapter XIV). Such an induction 
should be supported by evidence indicating that fundamental conditions in the 
field in question have not been altered over the time interval involved. This 
caution does not, of course, affect the procedure illustrated above. 
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standard error of the weighted average of the two z’s is 
given by 

V Wi - 3) + W» ~ 3)‘ 

The Chi-Square Test 

One of the great contributions of Karl Pearson to statisti- 
cal methodology was the determination of the form of 
the distribution of Chi-square, and the development of 
methods of utilizing this distribution. The character of 
this distribution and various tests based on it are our 
concern in the present section. 

THE NATURE OF CHI-SQUARE AND ITS DISTRIBUTION 

The quantity Chi-square (represented always by the 
symbol x 2 ) is a measure of the degree to which a series of 
observed frequencies deviate from corresponding theoretical 
or hypothetical frequencies. The theoretical frequencies 
are set up on the basis of some hypothesis, some rational 
argument. The magnitude of the discrepancy between 
theory and observation is defined by the quantity x 2 - It 
was Pearson’s contribution to determine the nature of the 
distribution of the values of x 2 that would be obtained 
under given sampling conditions. Knowledge of this dis- 
tribution enables us to determine whether a given discrep- 
ancy between theory and observation may be attributed 
to chance, or whether it results from the inadequacy of 
the theory to fit the observed facts. This instrument is 
obviously one of extreme importance in statistical analysis. 

The character of the distribution of x 2 may be discussed 
with reference to Weldon’s data relating to the results 
obtained in 4,096 throws of 12 dice (see page 433). We 
call a 4, 5, or 6 spot a success, a 1, 2, or 3 spot a failure. 
When 12 dice are thrown the expected (or theoretical) 
number of successes on each throw is 6. A deviation from 
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6 represents a discrepancy between expectation and observa- 
tion. From the result of each throw of 12 dice a value of x* 
may be computed. Thus, a given throw yields 2 successes 
and 10 failures. The 2 successes represent a deviation of 4 
from the expected value of 6; the 10 failures represent a 
deviation of 4 from the expected value of 6. (In such an 
experiment as this there are two components of each value 
of x 2 > even though when one component is given the other 
is necessarily determined. For the sum of successes and 
failures must be 12 on each throw.) The value of x 2 in 
a given instance is obtained by squaring the discrepancies 
between expectation and observation, dividing the squared 
values by the corresponding expected values, and adding 
the quantities thus obtained. That is 

X 2 = z{- fo -y ^ -} 

where / 0 denotes an observed frequency and / defines the 
corresponding theoretical frequency. 

In the case cited above we have 


X 


2 


(2 - 6) 2 (10 - 6) 2 _ 

6 + 6 “ 5m 


On another trial, with 7 successes and 5 failures, we have 


(7 - 6) 2 (5 - 6) 2 

X a • a 


.333. 


On still another trial, giving 6 successes and 6 failures, we 
have 


r . - (6 ~ 6) 2 ■ (6 - 6) 2 
A A ' A ’ 


The 4,096 throws thus yield 4,096 values of x 2 - Tabulating 
these with respect to the frequency of occurrence of stated 
values, we obtain the distribution given in Table 140 on 
page 620. 

This table gives us information as to the nature of the 
discrepancies between theoretical norms and actual results 
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Table 140 


Tabulation of 4,096 Observed Values of x 2 


Value 1 of V 2 

(measuring deviation of 

Frequency of 

Frequency of 

observation from expec- 

occurrence 

occurrence 

tancy in dice-throwing 

(absolute) 

(relative) 

experiment ) 

0to .833 

2,526 

966 

.6167 

.833 to 2.167 

.2358 

2.167 to 4.167 

455 

.1111 

4. 167 to 6.667 

131 

.0320 

Over 6 . 667 

18 

.0044 

Total 

4,096 

1.0000 


that chance may bring about. For deviations from the 
expected frequency of successes, 6, may be attributed to 
the mass of undifferentiated causes we call chance. The 
magnitude of x 2 varies, of course, with the degree of devia- 
tion. Values of x 2 not exceeding .833 are most frequent. 
Higher values of x 2 occur with decreasing frequency. Only 
18 out of 4,096 observed values of x 2 exceed 6.667. This 
distribution furnishes us, therefore, with a standard of 
reference to employ when seeking to determine whether 
a given discrepancy between theoretical and observed values 
is attributable to chance, or whether it is too great to be 
so explained. 

T his use of the table, as a standard for determining 
the probability that given discrepancies between theory 
and observation are attributable to the play of chance, 
is facilitated by a somewhat different arrangement. We 
may set up a table of cumulative values, based upon the 

1 The 4,096 values of x 2 tabulated here constitute a discrete series. Th& 
conditions of the experiment are such that the 4,096 observations on x 2 are 
distributed among only seven values, ranging from 0 to 12. In order that the 
observed frequencies of occurrence of stated values of X 2 may be compared (in 
a later table) with theoretical frequencies, an uneven class-interval is em- 
ployed above. Class limits are taken midway between successive values at 
which the actual observations fall. (The decimal fractions used in the table do 
not define these limits with full accuracy.) 
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tabulation of the 4,096 values of x 2 obtained in the preceding 
experiment. These are given in Table 141. 

The entries in col. (2) of this table indicate that in the 
experiment involving 4,096 throws of dice, a value of x 2 
of 6.667 or more occurs less frequently than 1 time out 
of 100 (only 44 times out of 10,000, in fact). A value as 
great as 4.167, however, occurred more frequently than 
3 times out of 100. If we interpret these relative frequencies 
as probabilities, we may obtain from such a table a knowl- 
edge of the probabilities corresponding to stated values 
of x 2 - Here is the instrument we desire, in seeking to deter- 
mine whether given observations conform closely enough 
with expectations based on theory, or on working hypotheses 
which perhaps are not yet ready to be dignified as theories. 


Table 141 


Cumulative Relative Frequencies of Occurrence of 4,096 Observed 
Values of x 2 , with Corresponding Theoretical Frequencies 1 


( 1 ) 

Value of x 2 
( cumulative deviation 
of observation 
from expectancy) 

0 or more 
. 833 or more 

2. 167 or more 

4. 167 or more 
6.667 or more 


( 2 ) 

Relative frequency 
of occurrence 

(observed) 

1.0000 

.3833 

.1475 

.0364 

.0044 


(3) 

Relative frequency 
of occurrence 
(theoretical) 

1.0000 

.3613 

.1411 

.0412 

.0098 


We should note two important limitations attaching to 
the entries in col. (2) of the above table, showing relative 
frequencies corresponding to stated values of x 2 * In the 
first place, these are merely empirical results, obtained 
from a given set of experiments. The conditions of the 
experiment yield a discontinuous series of values for x 2 - 
In some degree, this discontinuity has been ironed out by 

1 One degree of freedom is present in the determination of a single value of 
X 2 , in this example 
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the method of classification employed, but the instrument 
derived from this single experiment remains an imperfect 
one. The effects of chance fluctuations are present in 
these results, also, and contribute to the imperfection of 
the instrument. The true distribution of x 2 is only approxi- 
mated by the results presented in col. (2) of Table 141. 

The entries in col. (3) of Table 141 are free of this lim- 
itation. These record the frequencies with which values 
of x 2 falling within the limits indicated in col. (1) might 
be expected to occur, on the basis of mathematical theory, 
under the conditions of the present experiment. 1 These are 
the entries which provide the standard we desire, in deter- 
mining the significance of a given series of discrepancies 
between observation and expectation. It is to be noted, 
however, that the empirically derived table constitutes a 
fair approximation to the theoretical distribution of x 2 
under these conditions. 

The second limitation attaching to the example cited 
above is that each of the 4,096 values of x 2 tabulated has 
two components, and that the experiment is such that 
when one component is given the second is necessarily 
determined. (Since there are 12 events in each throw we 
know, for example, that if we have 8 successes there must 
be 4 failures.) This condition is described by saying that 
there is but one degree of freedom in the derivation of 
a given value of x 2 - The table we have obtained relates, 
therefore, to a special case — the distribution of values 
of x 2 computed with one degree of freedom. There are 
other possible cases. For each of these the distribution of x 2 
may be determined in a manner similar to that shown above. 

As an example of a different set of conditions we may 
consider the outcome of a throw of 24 dice, account being 
kept of the frequency of occurrence of each possible result 

1 These relative frequencies are taken from G. Udney Yule “Table of the 
values of P for divergence from independence in the fourfold table,” Journal 
of the Royal Statistical Society y Vol. LXXXV, January, 1922, 103-104. 
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(i.e., the appearance of a 1, 2, 3, 4, 5, or 6 spot). When 24 
dice are thrown there may be expected 4 one spots, 4 
two spots, 4 three spots, etc. In a given throw we obtain 


the following results: 

1 

2 

Number of spots 

3 4 

5 

6 

Observed frequency 

2 

5 

6 

4 

4 

3 

Expected frequency 

4 

4 

4 

4 

4 

4 


For the results of this throw the value of Chi-square would 
be given by 

(2 - 4) 2 , (5 - 4) 2 (6 - 4) 2 , (4 - 4) 2 , (4 - 4) 2 


a = 


— + + 4 


+ ^ . 4 ^ - = 2.50. 


This quantity has six components. However, as soon as 
five are given the sixth is determined, since the total number 
of events is fixed at twenty-four. There are, then, five 
degrees of freedom in the calculation of x 2 in this experiment. 

If the 24 dice were thrown a thousand times, say, we 
should have one thousand values of x 2 - A distribution 
of these could be constructed, similar to that derived 
empirically for the case in which there was one degree 
of freedom. It would be a different distribution, however, 
for the change in degrees of freedom has an obvious relation 
to the magni tude of x 2 - The character of the distribution 
of the values of x 2 that would be obtained in such an experi- 
ment is indicated by the entries in Table 142 on page 624. 
We do not here give empirical values, as in the preceding 
example. The table shows the theoretical frequencies with 
which given values of x 2 occur, when five degrees of freedom 
prevail. 

In using tables of this sort we may interpret meas- 
ures of relative frequency as probabilities. Thus we may 
read Table 142, which relates to the distribution of x 2 
computed with five degrees of freedom, as follows: If 
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Table 142 

Tabulation of x 2 Computed with Five Degrees of Freedom, with 
Cumulative Relative Frequencies 1 


Value of x 2 

Relative frequency 
of occurrence 

0 or more 

(i theoretical ) 

1.0000 

1 or more 

.9626 

2 or more 

.8491 

3 or more 

.7000 

4 or more 

.5494 

5 or more 

.4159 

6 or more 

.3062 

7 or more 

.2206 

8 or more 

.1562 

9 or more 

.1091 

10 or more 

.0752 

11 or more 

.0514 

12 or more 

.0348 

13 or more 

.0234 

14 or more 

.0156 

15 or more 

.0104 

16 or more 

.0068 

30 or more 

.000015 

00 

.000000 


the true value of x 2 is zero (i.e., in an infinitely large sample 
observed frequencies would agree precisely with the theo- 
retical frequencies we have set up), the probability of our 
seeming a x 2 of zero ° r more, from a sample of the type 
here employed, is 1.00; the probability of our securing 
a x 2 of 1.00 or more is 9,626/10,000; the probability of 
our securing ax 2 of 3.00 or more is 7/10; the probability 
of our seeming a x 2 infinitely large is 0. The quantities 
x 2 and P stand, thus, in a definite functional relationship, 
for any given value of n (n denotes the number of degrees 
of freedom). At the two limits the relationships are the 

1 From the table prepared by W. P. Elderton and given in Tables for Statis- 
ticians and Biometricians , Karl Pearson, editor, 26. The n' of Elderton’s table 
is equal to n + 1, for an example of the type here given. 
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Table 143 1 


Table of x 2 I or Selected Values of P and n 


n 

P - .99 

.95 

.50 

.10 

.05 

.02 

.01 

1 

.000157 

.00393 

.455 

2.706 

3.841 

5.412 

6.635 

2 

.0201 

.103 

1.386 

4.605 

5.991 

7.824 

9.210 

3 

.115 

.352 

2.366 

6.251 

7.815 

9.837 

11.341 

4 

.297 

.711 

3.357 

7.779 

9.488 

11.668 

13.277 

5 

.554 

1.145 

4.251 

9.236 

11.070 

13.388 

15.086 

6 

.872 

1.635 

5.348 

10.645 

12.592 

15.033 

16.812 

7 

1.239 

2.167 

6.346 

12.017 

14.067 

16.622 

18.475 

8 

1.646 

2.733 

7.344 

13.362 

15.507 

18.168 

20.090 

9 

2.088 

3.325 

8.343 

14.684 

16.919 

19.679 

21.666 

10 

2.558 

3.940 

9.342 

15.987 

18.307 

21.161 

23.209 

11 

3.053 

4.575 

10.341 

17.275 

19.675 

22.618 

24.725 

12 

3.571 

5.226 

11.340 

18.549 

21.026 

24.054 

26.217 

13 

4.107 

5.892 

12.340 

19.812 

22.362 

25.472 

27.688 

14 

4.660 

6.571 

13.339 

21 064 

23.685 

26.873 

29.141 

15 

5.229 

7.261 

14.339 

22.307 

24.996 

28.259 

30.578 

16 

5.812 

7.962 

15.338 

23.542 

26.296 

29.633 

32.000 

17 

6.408 

8.672 

16.338 

24.769 

27.587 

30.995 

33.409 

18 

7.015 

9 390 

17.338 

25.989 

28.869 

32.346 

34.805 

19 

7.633 

10.117 

18.338 

27.204 

30.144 

33.687 

36.191 

20 

8.260 

10.851 

19.337 

28.412 

31.410 

35.020 

37.566 

21 

8.897 

11.591 

20.337 

29.615 

32.671 

36.343 

38.932 

22 

9.542 

12.338 

21.337 

30.813 

33.924 

37.659 

40.289 

23 

10.196 

13.091 

22.337 

32.007 

35 172 

38.968 

41.638 

24 

10.856 

13.848 

23.337 

33.196 

36.415 

40.270 

42.980 

25 

11.524 

14.611 

24.337 

34.382 

37.652 

41.566 

44.314 

26 

12.198 

15.379 

25.336 

35.563 

38.885 

42.856 

45.642 

27 

12.879 

16.151 

26.336 

36.741 

40.113 

44.140 

46.963 

28 

13.565 

16.928 

27.336 

37.916 

41.337 

45.419 

48.278 

29 

14.256 

17.708 

28.336 

39.087 

42.557 

46.693 

49.588 

30 

14.953 

18.493 

29.336 

40.256 

43.773 

47.962 

50.892 

same for all 

values of 

n. When x 2 : 

= 0, P 

= 1.00; 

when 


X 2 = oo, P =0.00. But for intermediate values the rela- 
tionship varies with n. 

In 1900 Karl Pearson defined the distribution function 


1 This table is reproduced here through the courtesy of R. A. Fisher and hi$ 
publishers, Oliver and Boyd, of Edinburgh. The entries are taken from 
Table III of Statistical Methods for Research Workers . 
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of x 8 * The actual application of the x 2 test is facilitated by 
prepared tables. Selected entries from these tabulations, 
for different values of n, are given in Table 143 on page 625 
and in Appendix Table V. 

For determining the significance of x 2 beyond the range 
of this table, Fisher has given V2x 2 — V2n— 1, as a value 
which may be interpreted as a normal deviate. That is, 
the figure derived when stated values of x 2 and n are inserted 
in the above expression is to be taken as a deviation from 
the mean of a normal distribution, expressed in units of 
the standard deviation of that distribution. The corre- 
sponding value of P is then derived from a table of areas 
under the normal curve. 

The x 2 test is applicable to a considerable variety of 
problems. Wherever, on rational grounds, a set of theoretical 
frequencies may be derived, for comparison with observed 
frequencies, this test is appropriate in judging of the 
significance of the discrepancy between the two sets of 
frequencies. In the following pages three applications of 
this test are exemplified. 

THE CHI-SQUARE TEST OF GOODNESS OF FIT 

When an ideal frequency curve, whether normal or of 
some other type, is fitted to an actual frequency distribution, 
theory and observation are being compared. A test of the 
concordance of the two (i.e., of goodness of fit) may be 
made by inspection, but such a test is obviously inadequate. 
Precision may be secured by employing the x 2 test. The 
example in Table 144, relating to the distribution of tele- 
phone subscribers discussed in Chapter XIII, illustrates the 
procedure. 

There are 15 classes in this distribution. Since the totals 
of columns (2) and (3) must be the same, one degree of free- 
dom is lost. In arbitrarily making the mean and standard 
deviation of the theoretical distribution equal to those of the 
observed distribution, we lose two more degrees of freedom. 
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Table 144 

Computation of x 2 for Testing Goodness of Fit 
Normal Curve of Error Fitted to Distribution of Telephone Subscribers 


(1) 

(2) 

(3) 

(4) 

(5) 

Class 

limits 

Observed 
frequency 
f o 

Theoretical 

frequency 

f 

(fo~f) 

(fo-f)' 

f 

150 and less 

10 

13.14 

- 3.14 

.75 

150-200 

19 

16.76 

+ 2.24 

.30 

200-250 

38 

31.57 

+ 6.43 

1.31 

250-300 

50 

53.02 

- 3.02 

.17 

300-350 

95 

79.43 

+ 15.57 

3.05 

350-400 

85 

106. 10 

- 21.10 

4.20 

400-450 

115 

126.41 

- 11.41 

1.03 

450-500 

132 

134.31 

- 2.31 

.04 

500-550 

144 

125.50 

+ 18.50 

2.73 

550-600 

116 

106.51 

+ 9.49 

.85 

600-650 

79 

81.85 

- 2.85 

.10 

650-700 

54 

55.21 

- 1.21 

.03 

700-750 

31 

33.19 

- 2.19 

.14 

750-800 

11 

17.81 

- 6.81 

2.60 

More than 800 

16 

14.19 

+ 1.81 

.23 


995 

995.00 

15 groups 

X J = 17.53 


In effect, three constants have been employed in the fitting 
process. The given value of x 2 > 17 . 53, is determined, there- 
fore, with 12 degrees of freedom (the number of classes less 
3). 1 From Table 143 we see that when n = 12 a value of 
X 2 as great as 21.026, or greater, would occur purely as a 
result of chance in 5 out of 100 random samples, if the true 
value of x 2 were zero. The value 17.53 is not excessively 
high. Observed discrepancies may be attributed to chance. 
The fit obtained with the normal curve is acceptable. 
Our results are not inconsistent with the hypothesis that 
the observations in question follow the normal law of 
error. 

1 This derivation of degrees of freedom, which differs from that originally 
employed by Pearson and exemplified in the first printing of this edition, gives 

a more accurate and logical test of significance than does the older procedure 
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In applying the Chi-square te3t it is not necessary to 
determine the exact probability corresponding to a stated 
value of x 2 * Our purpose, in general, is to ascertain whether 
observed results are or are not consistent with the hypothesis 
on which the fitting procedure is based. For this purpose 
we wish only to know whether the value of P corresponding 
to a given value of x 2 falls below (or, much more rarely, 
above) certain critical values. As a conventional limit .05 
is usually employed. If a value of x 2 is such that P is below 
.05, the discrepancies between observed and theoretical 
values are, on this standard, considered too great to be 
attributed to chance. The hypothesis on the basis of which 
the theoretical frequencies have been determined is suspect, 
in such a case. If x 2 is large enough to give values of P 
below .02 or .01, the inadequacy of the hypothesis is, 
of course, more strongly indicated. 

R. A. Fisher points out that suspicion should attach 
to very low values of x 2 , which give values of P of .99 
or thereabout. These values indicate a very close agreement 
between the hypothesis and the observed facts. Such close 
agreement may be due to chance, but there is strong proba- 
bility that the hypothesis is at fault or, in mathematical 
terms, that the wrong function is being used. Coincidence 
of observed and theoretical values suggests the kind of 
agreement one obtains by fitting to n points a curve in 
the equation to which there are n constants. Any artificial 
forcing of agreement between hypothesis and observation 
of course invalidates the application of the Chi-square test. 

In applying the Chi-square test it is convenient to use 
the conventional standards we have noted, as guides to 
the rejection or provisional acceptance of working hypothe- 
ses. It is unwise to use these standards arbitrarily, however. 
No single standard possesses significance in any absolute 
sense. The investigator in a given field of research will 
interpret the information such a test yields in the light of 
other knowledge relating to that field of experience, and with 
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due regard to the rational foundation of the hypotheses 
being tested. 

One feature of Table 144 requires explanation. It will 
be noted that in the construction of this table the three 
classes at the lower end of the distribution have been lumped 
into one, and that the same thing has been done with the 
six classes at the upper end of the distribution (Cf. Tables 
109 and 144). This is done to avoid the undue magnification 
of slight differences between the tails of the observed and 
theoretical distributions. When /, the theoretical frequency, 
is very small, a relatively slight absolute discrepancy between 
/o and / may serve to swell materially the value of x 2 - The 
lumping process is designed to prevent such a distortion. 
Since the selection of classes for combination rests on the 
personal judgment of the investigator, a subjective element is 
necessarily introduced here. However, the results of the 
test will not usually be much affected by minor variations 
in the combination of tail-end classes. 1 

The use of x 2 in. testing the fit of theoretical frequency 
curves is subject to another rather important limitation. 
In the computation of x 2 no account is taken of the distribu- 
tion of discrepancies between f 0 and /. Yet the manner in 
which these discrepancies are distributed may materially 
influence our judgment as to the goodness of a given fit. 
In such an example as that given in Table 144, the successive 
values of fo — f, counting from the lower limit of the x-scale, 
might be alternately positive and negative. Something 
approaching this alternation would be expected if chance 
factors alone accounted for the differences between observed 
and theoretical frequencies. But the differences might be 
distributed otherwise. All the values of /o - / below the 

1 Considerations of the same sort suggest that a sample of reasonable size is 
needed for the valid application of the Chi-square test in curve fitting. Deming 
and Birge set 500 observations as the minimum required in a test of this type, 
if confidence is to be placed in the result. Yule and Kendall suggest a smaller 
number, but place emphasis on the need of an adequate number of theoretical 
observations (preferably not less than 10) in every class. 
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mode mi gh t be positive, while all the values above the 
mode might be negative. The cumulated discrepancies, 
as measured by x 2 , might be equal in the two cases, yet 
far more confidence would attach to a fit marked by alterna- 
tions of plus and minus deviations than to one in which 
a series of positive deviations were bunched together on 
the scale, and negative discrepancies were correspondingly 
clustered. This limitation serves as a warning against 
purely mechanical use of the x 2 test. Examination of the 
fit, and interpretion of x 2 m the light of the actual distribu- 
tion of discrepancies, are required in the application of 
this test. 

THE CHI-SQUARE TEST OP INDEPENDENCE OF PRINCIPLES OF 

CLASSIFICATION 1 

A question that frequently arises in research has to do 
with the relation between two principles of classification. 
Thus, in studying commodity price movements during 
revivals after business depressions, we may divide all com- 
modities into durable and non-durable classes. We may 
again divide them into those the prices of which precede 
the general average of commodity prices in the revival, 
and those that lag behind the general index. If the quality 
of durability has no relation to the timing of price recovery, 
the two principles of classification are independent. How- 
ever, certain considerations relating to the character of 
demand for durable and non-durable goods lead us to 
believe that the durability of a good is related to the 
behavior of its market price during a period of business 
revival. It is possible to apply an objective test to determine 
whether these principles of classification are, in fact, related. 

Observed frequencies are recorded in Table 145. 2 

1 For a discussion of tests of independence and homogeneity see Chapter IV 
of Statistical Methods for Research Workers , by R. A. Fisher. 

* Data from The Behavior of Prices , National Bureau of Economic Research, 
New York, 1927, with later additions. 
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Table 145 


Observation 

Two-fold Classification of 208 Commodities 


Commodity group 


Durable goods 
Non-durable goods 
Total 


Observed frequencies 


Number 

Number 


'preceding 

lagging behind 

Total 

general index 

general index 

on price rise 

on price rise 


6 

61 

67 

50 

91 

141 

56 

152 

208 


The nature of the durability classification requires no 
explanation. The classification relating to the timing of 
price changes in business revival is based on the average 
behavior of each of the 208 commodities during 13 periods 
of business revival occurring between 1890 and 1936. The 
process of cross-classification gives four “cells” among 
which the 208 commodities are divided in the manner 
indicated in the table. 

With the observed frequencies that constitute the entries 
in these four cells we may compare a set of theoretical 
frequencies, derived from the hypothesis that the durability 
of economic goods has no relation to the timing of price 
advances after business depressions. These expected fre- 
quencies are computed readily from the sub-totals. The 
67 durable goods constitute 32.21 per cent of all the com- 
modities, while the 141 non-durable goods constitute 67 . 79 
per cent of the total. If durability has no relation to the 
timing of price advance, after depression, we should expect 
the 56 commodities that preceded the general index to 
be divided between durable and non-durable goods in this 
same proportion. That is, 32.21 per cent of the 56 com- 
modities, or 18.04, should be durable, while 67.79 per 
cent of the 56, or 37.96, should be non-durable. Similarly, 
the 152 commodities lagging behind the general index in 
price revival should be divided between the durable and 
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non-durable categories in exactly the same way, 32.21 per 
cent in the durable class, 67 . 79 per cent in the non-durable. 
These expected frequencies, which conform to our hypothe- 
sis that the two principles of classification are independent, 
are given in Table 146. 


Table 146 

Expectation 

Two-fold Classification of 208 Commodities 
Commodity group Expected frequencies 



Number 

Number 



preceding 

lagging behind 

Total 


general index 

general index 


on price rise 

on price rise 


Durable goods 

18.04 

48.96 

67 

Non-durable goods 

37.96 

103.04 

141 

Total 

56.00 

152.00 

208 


Chi-square is computed from the relation x 2 =2 
in the following manner : 


own 


X 2 


(6 - 18. 04) 2 (50 - 37. 96) 2 (61 - 48. 96) 2 

18.04 + 37.96 + 48.96 

(91 - 103. 04) 2 
+ 103.04 


16.222. 


There are four components of Chi-square in this instance, 
but, as may readily be seen by reference to the table of 
expected frequencies, only one degree of freedom enters 
into its computation. The expected frequencies must yield 
the four group totals, 56, 152, 67, and 141. Accordingly, 
as soon as we fill one of the four cells set up by the process 
of cross-classification, the other three are definitely deter- 
mined. Given 18.04, the expected number of durable 
goods preceding the general index in price revival, the 
entries in the other cells are fixed. Subtraction of 18.04 
from 56 and 67 will fill two of them, and the filling of these 
determines the fourth. 
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For the interpretation of the given value of Chi-square 
we turn to Table 143, which is to be read with n, the number 
of degrees of freedom, equal to 1. If durability of economic 
goods has no relation to the timing of their price changes 
in revival, the two principles of classification employed 
above are independent and the true value of Chi-square 
is zero. Are the observed results consistent with this 
hypothesis? The entries in Table 143 indicate that if the 
true value were zero, a value as great as 3.841 would 
occur 5 times out of 100, as a result of chance fluctuations. 
A value as great as 6.635 would occur only 1 time out of 
100. The present value of Chi-square, 16.222, represents 
a still smaller probability. The results are not consistent 
with the hypothesis we have set up. The differences between 
the observed and expected frequencies are too great to be 
attributed to the play of chance. Durability, and factors 
of demand and supply related thereto, appear to play a 
definite role in the timing of price advances in business 
revivals. 

This test, it should be noted, does not define the relation- 
ship between durability of goods and the timing of price 
revival. It leads us to reject the hypothesis that durability 
has no bearing on the sequence of price advances in revival. 
If, on the basis of some other rational hypothesis, we could 
obtain a set of expected frequencies representing a definite 
relationship other than one of independence, this hypothesis 
could be tested in the same manner. From the present 
evidence, however, we may only conclude that the proportion 
of durable goods preceding the general price index on revival 
is smaller and the proportion of non-durable goods larger 
than would be expected if durability had no relation to 
the timing of price recovery after a business depression. 

THE CHI-SQUARE TEST OF HOMOGENEITY 

For each of eight major industrial groups we have records 
showing, for the year 1933, the number of corporations 
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reporting net incomes from their operations and the number 
reporting no net incomes (i.e., suffering deficits). The 
returns relate to a total of 492,649 corporations. Is this 
total a homogeneous whole, or does the division of corpora- 
tions between those earning net incomes and those suffering 
deficits vary significantly from group to group? The records 
appear in Table 147. 


Table 147 


Comparison of Observed and Theoretical Frequencies 

(Tabulations based on corporate income tax returns for 1933, 
by major industrial groups) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(8) 



Actual 








number 

Theoretical 


Actual 

Theoretical 



Total 

of 

( expected ) 


number of 

(expected) 


Group 

number 

returns 

number of 


returns 

number 



of 

showing 

returns 


showing 

showing 



returns 

no net 

showing no 


net 

net 




income 

net income 


income 

income 




(fo) 

(/) 

/o-/ 

/o' 

if') 

/o' ~ f 

Agriculture and 








related indus- 








tries 

10,490 

7,818 

7,150 

+ 668 

2,672 

3,340 

- 668 

Mining and 
quarrying 

17,147 

8,866 

11,688 

- 2,822 

8,281 

5,459 

+ 2,822 

Manufacturing 

93,833 

62,295 

63,958 

- 1,663 

31,538 

29,875 

+ 1,663 

Construction 

18,234 

14,112 

12,428 

+ 1,684 

4,122 

5,806 

- 1,684 

Transportation 








and other pub- 
lic utilities 

24,302 

14,349 

16,564 

- 2,215 

9,953 

7,738 

+ 2,215 

Trade 

137,858 

93,621 

93,965 

- 344 

44,237 

43,893 

+ 344 

Service 

47,843 

35,419 

32,610 

+ 2,809 

12,424 

15,233 

- 2,809 

Finance 

142,942 

99,314 

97,431 

+ 1,883 

43,628 

45,511 

- 1,883 

Total 

492,649 

335,794 

335,794 


156,855 

156,855 


Per cent 

100.000 

68.161 



31.839 




The observed frequencies are, of course, the actual returns 
given in cols. (3) and (6) of Table 147. Sets of theoretical 
or expected frequencies may be set up, for comparison with 
these, on the assumption that all corporations in the United 
States constituted a homogeneous population as regards the 
likelihood of suffering a deficit or earning a net income in 
1933. On this assumption we may say that the probability 
of failing to earn a net profit was, for all the elements of this 

335 794 

assumed homogeneous population, 492^49 or - 68161 - Con " 

156,855 

versely, the probability of earning a net income was 49^549 
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or .31839. If these are the true probabilities for all elements 
of the population, we may determine theoretical frequencies 
for each industrial group by applying these ratios to the 
total number of corporations in that group. On the assump- 
tion made we should find, in all groups, the same propor- 
tionate division between corporations earning net incomes 
and those suffering deficits, except for modifications due to 
fluctuations of sampling. The expected frequencies appear 
in columns (4) and (7). If the hypothesis of homogeneity 
is valid, these are the true frequencies for the several groups. 
Differences between these and the observed frequencies re- 
flect the play of chance alone. 

The calculation of x 2 , measuring the degree of discrep- 
ancy between the observed and theoretical frequencies, is 
shown in the next table. It will be noted that there are 
two components of x 2 , one derived from the comparison of 
actual and theoretical frequencies among the corporations 
suffering deficits, the other from similar data for corpora- 
tions earning net incomes. The symbols f 0 and / are used 
with prime marks for the second group, to avoid ambiguity, 
but both elements enter into the final measurement desired. 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Group 

ft ~f 

(/.-/)* 

(/.-/)* 

f 

ff ~f 

C fo'-f')* 
f 

Agriculture and 
related in- 

dustries 

+ 668 

446,224 

62.4090 

- 668 

133.6000 

Mining and 

quarrying 

- 2,822 

7,963,684 

681.3555 

+ 2,822 

1,458.8174 

Manufacturing 

- 1,663 

2,765,569 

43.2404 

+ 1,663 

92.5713 

Construction 

+ 1,684 

2,835,856 

228.1828 

- 1,684 

488.4354 

Transportation 
and other pub- 

lie utilities 

- 2,215 

4,906,225 

296.1980 

+ 2,215 

634.0430 

Trade 

- 344 

118,336 

1.2594 

+ 344 

2.6960 

Service 

+ 2,809 

7,890,481 

241.9650 

-2,809 

617.9860 

Finance 

Total 

+ 1,883 

3,545,689 

36.3918 

1,591.0019 

- 1,883 

77.9084 

3,406.0575 
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The required value of x 2 is equal to the sum of the entries 
in columns (4) and (6) of the above table. That is, x 2 equals 
1,591.0019 + 3,406.0575, or 4,997.0594. This is derived 
from 16 individual entries, 2 for each of the 8 classes. There 
are, however, only 7 degrees of freedom in its derivation. 
For as soon as 7 of the 16 cells for which frequencies are 
to be determined have been filled in, the other 9 are fixed. 
(The sum of the number of returns showing no net incomes 
and those showing net incomes must equal the actual num- 
ber of returns for each class, and the grand total of theoretical 
frequencies must equal total actual frequencies.) Since the 
1% value of x 2 , for n = 7, is only 18.475, it is clear that the 
discrepancy is too great to be attributed to chance. The 
results are not consistent with the hypothesis of homogeneity. 
We are not justified in assuming that the forces affecting 
the profitability of corporate operations in 1933 were the 
same, among the eight major industrial groups here repre- 
sented. 

The various procedures discussed in this chapter give 
some indication of the variety and power of the methods 
available for use in interpreting and appraising the results 
of statistical research. Each one involves, in some form, 
the testing of hypotheses against evidence yielded by the 
study of samples. It should be emphasized that the formal 
procedures described in the preceding pages are employed at 
a rather late stage in actual research work. The experiment 
will have been planned, the field work done, hypotheses 
framed before the tests here discussed can be applied. 
These various steps must, of course, be coordinated. The 
data must be gathered with reference to the hypotheses 
to be tested and to the analytical methods to be employed. 
Acquaintance with appropriate techniques is one pre- 
requisite of intelligent planning of research in which quanti- 
tative data are utilized. Familiarity with the characteristics 
and limitations of the available materials, and clear definition 
of the questions at issue, are equally important elements. 
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THE METHOD OF LEAST SQUARES AS APPLIED 
TO CERTAIN STATISTICAL PROBLEMS 

The method of least squares in the case of a single 
unknown quantity is merely a procedure for obtaining the 
most probable value of that quantity from a number of 
separate observations. The most probable value is that 
for which the sum of the squares of the deviations (or 
residuals) is a minimum. This is the arithmetic mean of 
the observations. 

Where the measurements or observations do not relate 
directly to a single unknown quantity, but to functions of a 
number of unknown quantities, the problem is somewhat 
different. In the first case mentioned each observation is 
in the form of a single magnitude. In the present case 
each observation is in the form of an observation equation 
in which the observed values of the variables, as found in 
combination, are entered. The unknown quantities are 
the constants which define the functional relationship 
between the variables in question. Our problem is that 
of finding the most probable values of these constants, the 
true values being unknown. 

As in the simpler case the most probable values are 
those for which the sum of the squares of the residuals 
is a minimum. In this case, however, the residuals are 
deviations, not from a single magnitude, as in the case 
of the arithmetic mean, but from the curve which describes 
the most probable functional relationship. The residuals 
are the differences between the computed and the actual 
values of the dependent variable. 
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DERIVATION OF THE NORMAL EQUATIONS 

Representing by Y an observed value of the dependent 
variable, by Y c the corresponding computed value, by v the 
residual, or difference between Y and Y c , and by W h W 3 , 
W», and Wi different independent variables (or different 
functions of a single independent variable), we may write 

Y c = f(W h W a , W 3 , Wi) 

V = Y e — Y 

= f(Wi,W t ,W 3 , Wi) - Y 
2(v°) = 2[f(Wi, W 2 , W 3 , Wi) - Y]\ 

If the function in a particular case is of the type 

Y e = aWi + bWi + cWi + dWi 

we have 

S(» s ) = 2[(aTF a + bW* + cW 3 + dWi) - Y]K 

Our problem is that of determining the most probable 
values of the constants that define the function. These 
constants are represented, in the present case, by a, b, c, 
and d. (The W’s, it should be noted, refer to quantities 
which are known, once the observation equations are given. 
In the usual case the JV’s are different functions of a single 
variable, but this is not essential.) On the assumption 
that the errors of observation are distributed in accordance 
with the normal law of error, it may be demonstrated 
that the most probable values of a, b, c, and d, in the above 
equation, are those which render 2(t> 2 ) a minimum; i.e., 

2[(aTFi + bWi + cW 3 + dW ^ — Y] 2 = a minimum. (a) 

The normal equations necessary for the solution may be 
obtained by equating to zero the partial derivatives of 
the above expression with respect to the unknowns, a, b, 
c, and d. That is, we first differentiate the above function 
with respect to o, holding b, c, and d constant, then with 
respect to 6, holding a, c, and d constant, then with respect 
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to c, holding a, b, and d constant, then with respect to d, 
holding a, b, and c constant. Carrying through this operation 
with respect to a, we have 

-i-A(aW 1 + bW 2 + cW 3 + dW 3 ) - Y]* = 0 

Od 

or 

I STF^CaFi + bW 2 + cW a + dW a) - Y] = 0. 
Differentiating equation (a) now with respect to b, we have 

4-2[(aTF x + bW 2 + cW 3 + dJF«) — F] 2 = 0 

OO 

or 

II 'ZWtHaWx + bW 2 + cWz + dWa) - Y] = 0. 
Differentiating equation (a) with respect to c, 

^-Z[( aJFi + bW 2 + cW 3 + dWt) - F] 2 = 0 
oc 

or 

III STF 3 [(aWi + bW 2 + cW 3 + dWJ - Y] = 0. 
Differentiating equation (a) with respect to d, 

^SKalF, + bW 2 + cW 3 + dW a) - F] 2 = 0 


or 

IV 2PF 4 [(aIF 1 + bW 2 + cW 3 + dW a) - Y] = 0. 

The most probable values of the quantities o, b, c, and 
d are secured by solving simultaneously the four normal 
equations thus obtained (numbered above I, II, III, IV). 

FORMATION OF THE NORMAL EQUATIONS 

When the observation equations are all of the first degree 
(i.e., of the first degree with respect to the unknown quan- 
tities, a, 6, c, etc.) the normal equations may be secured 
by the following process : 
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1. Write the equation which describes the assumed relationship. 
The observation equations are derived by substituting in this 
equation the observed values of the variables, as found in com- 
bination. 

2. Multiply each observation equation by the coefficient of the 
first unknown in that equation ; the sum of the resulting equations 
constitutes the first normal equation. 

3. Multiply each observation equation by the coefficient of the 
second unknown in that equation; the sum of the resulting equa- 
tions constitutes the second normal equation. 

Continue this process until normal equations equal in 
number to the unknown quantities are obtained. 

The actual process of forming the normal equations in 
curve fitting may be simplified, and the writing out of the 
separate observation equations avoided, as was demonstrated 
in earlier sections. The following may be laid down as 
general rules for the formation of the desired normal 
equations : 

1. Write the equation of the curve to be fitted. For the purpose 
of this explanation we may employ the general form 

Y = aWi + bW 2 + cW 3 + dW* + . . . (1) 

where Y represents the dependent variable, a, b, c, d, . . . repre- 
sent the constants in the equation (the unknown quantities in the 
present instance) and W lf W 2 , W 3 , W 4 , . . . represent the coeffi- 
cients of these unknowns. It is assumed that these coefficients 
represent variables, and that term is used with reference to them. 
Call this equation (1). 

2. Multiply each term in equation (1) by the coefficient of the 
first unknown in (1) (i.e., by W\) and place the summation sign, 
2, before each variable. This is the first normal equation (I). 

3. Multiply each term in equation (1) by the coefficient of the 
second unknown (i.e., by W 2 ) and place the summation sign before 
each variable. This is the second normal equation (II). 

4. Multiply each term in equation (1) by the coefficient of the 
third unknown (i.e., by W 3 ) and place the summation sign before 
each variable. This is the third normal equation (III). 

5. Multiply each term in equation (1) by the coefficient of the 
fourth unknown (i.e., by W 4 ) and place the summation sign before 
each variable. This is the fourth normal equation (IV). 
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The process may be continued until normal equations 
equal in number to the unknown quantities are obtained. 1 

A STANDARD SET OF NORMAL EQUATIONS 

As a set of generalized normal equations secured by the 
above process and applying to any equation which can be 
put in the form 

Y = aWi + bWi + cWi + dWi + . . . , 

we have 

i 2(^7) 

= a2(WV) + VZiWiWi) + cX(WiW 3 ) + d2(TW 4 ) + . . . 
II 2(TF 2 7) 

= aS(WiW 2 ) + 62(WV) + c 2(HW 3 ) + d2(TT 2 TF 4 ) + . . . 

III 2 (W 3 Y) 

= dZ(WiWJ + b2(W 2 W 3 ) + c2(WY) + d2{WsW<) + . . . 

IV 2(17*7) 

= dL(WiW^ + bZ(W 2 W<) + cL(W 3 W<) + <ffi(T7 4 2 ) + . . . 

By substituting for W 3) Wt, W», W t) etc., the particular 
functions employed in a given case, these equations may 
be readily adapted to any type of curve in the fitting of 
which the method of least squares is applicable. Thus in 
fitting a curve represented by the equation 

Y = a + bX + cX 2 + dX* 

substitutions in the standard normal equations given above 
are based upon the following relations : 

Wi = 1 

W* = X 
W 3 = X i 
Wt = x\ 

The changes to be made in the normal equations are 
obvious. 2(PFi 7) becomes 2(7); 2(J7i 2 ) is equivalent to 
2(1 J ), which is equal to N, the total number of observations. 

‘These rules represent an adaptation of a similar series formulated by 
Raymond Pearl in Medical Biometry and Statistics, 341. 
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The first normal equation becomes 

2(7) = Na + 62(X) + c2(X*) + d2(X*). 

The other normal equations are modified correspondingly. 

In the example just given, the coefficients are all different 
functions of a single independent variable, X. It is not, of 
course, essential to the method of least squares that this 
be so. The coefficients, W u W 2 , W$, etc., may represent a 
number of independent variables, as in the case of multiple 
correlation. 

The limitations to the method of least squares must be 
borne in mind in making use of it. This method, in its 
direct application, is limited to cases in which the equation 
to the curve to be fitted is linear in the constants, i.e., the 
observation equations must all be linear as regards the 
unknown values, a, b, c, etc. (This does not mean, of course, 
that the equation to the fitted curve must be linear.) As 
an example of this limitation, we may cite a curve having 
as equation y = ab cX , which cannot be fitted directly by 
the method of least squares. If the observation equations 
are non-linear they may be reduced to the linear form in 
many instances by the use of logarithms, and the method 
of least squares then employed. 

DERIVATION OF THE FORMULA FOR THE STANDARD ERROR 

OF ESTIMATE 

It has been pointed out in the body of the text that the 
standard error of estimate may be derived as a by-product 
of the method of least squares. A more complete demon- 
stration of this process may be given at this point. 

When the partial derivative of the expression 

2[(aWi + bWz + cWi + dW t ) — Y] 2 = a minimum 

is equated to zero, with respect to the first unknown, a, 
we have 


ZWJKaWi + bWi + cWi + dW<) - Y] = 0. 
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Since 

aWi + 6TF2 + cWi + dW i — Y — v, 
we have as a necessary condition of fitting 

S(t>Wx) = 0. 

When the partial derivative of the same expression with 
respect to b is equated to zero, we have 

STF 2 [(aWi + bWi + cW 3 + dW t ) - Y) = 0 
or, making the same substitution as in the preceding case, 

2 (vW 2 ) = 0. 

Repeating the operation with respect to c and d, we may 
show that 

2(t>F 3 ) = 0 

and 

2(t>TF 4 ) = 0. 

In summary: When the method of least squares is 
employed in determining the most probable values of cer- 
tain unknown quantities, having as known coefficients the 
quantities W h W 2 , W 3 , Wt, the following relations hold 
as a necessary condition of the least squares method: 

2(»lFi) = 0 
2 Wt) = 0 
2(^3) = 0 

2 {vWi) = 0. 

A knowledge of these relationships gives us a method of 
securing readily the value 2(v 2 ) and the standard error of 
estimate. Assume that, by the method of least squares, we 
have determined the constants in an equation of the type 

Y e = aWr + bW 2 + cW 3 + dWi. 

For each residual we have the relation 


v = aWx + bW 2 + cW 3 + dW 4 - Y. 


( 1 ) 
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Multiplying throughout by v, and summing, we have 
2(i> 2 ) = a 2(t>TFi) + bZ(vW 2 ) + eS(vW t ) + d2(vW t ) - 2(Ft>). (2) 

But 

2(«Fi) = 0 
2(t»TFs) = 0 
2(t>TF 3 ) = 0 
2(vIF 4 ) = 0 

therefore, 

2(« 2 ) = - 2 (Yv). (3) 

Multiplying each equation (1) throughout by F, and 
adding, we have 


2(Fr) = dZiWxY) + 62(TF 2 7) + c2(TF 3 F) + rf2(F 4 F) 

— 2(F 2 ). (4) 

Substituting in (3) the equivalent of 2(Fi>), we have 

2(t> 2 ) = 2(F) 2 - aZ(WiY) - 62(TF 2 F) - c2(TF 3 F) 

-dZ(WiY). (5) 

This gives us a method of obtaining the value 2(i> 2 ) 
without computing the separate residuals, a method that 
is applicable whenever the equation of the curve to be 
fitted is of the form, or may be reduced by the use of loga- 
rithms, reciprocals, or other manipulation to the form 


Y = aW 1 + bW 2 + cWz + dWz. 


In applying this to a particular case it is necessary only to 
replace W h W 2 , W 3 , W 4 , etc., by the functions that actually 
appear as coefficients of the unknown quantities in the 
original equation. Thus in fitting a curve the equation to 
which is 

F = a + bX + cX* + dX\ 


we find, as noted above, that 


Wi = 1 

W 2 = X 
W 3 = X 2 
W t = X s . 
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Making these substitutions in equation (5) above, we have 
2(u*) = 2(7*) - o2(7) - 62(Xy) - c2(X*y) - d2(X»y). (6) 

The standard error, S y , is derived from the equation 


where d is used to represent a deviation from a fitted curve. 
The deviation, d, then, is but another term for the resid- 
ual v. Accordingly, as a general expression for the standard 
error of Y, with W\, W 2 , W», and TF 4 as independent 
variables, we have 

os _2P- a2(T7 x y) - b2(W 2 Y) - c 2 (JF 3 y) - d2(W<Y) 

As in the previous case, this may be applied to a particular 
problem by replacing W u W 2 , W 3 , W A , etc., by the actual 
coefficients of the unknown quantities. 


DERIVATION OP THE FORMULA FOR THE INDEX OF 
CORRELATION 


We have adopted as an index of the degree of correlation 
between two variables the measure p (rho), derived from 
the equation 

pV = 1 - (8) 

<v 

assuming a single dependent variable, Y, and a single inde- 
pendent variable, X. With a single dependent variable, Y, 
and a number of independent variables, W h W 2 , W 3 , W i} 
the expression might be written 

pVwiwtuiaut = 1 ~T 2 (^) 


* Since our object is to measure the actual “scatter” about the fitted curve, 

Z(d«) 2(d») 

the formula is used, rather than the formula ^ (where N repre- 


sents the number of observations and N e the number of constants in the equa- 
tion to the fitted curve). The second formula would be used, in accordance 
with the theory of least squares, if we were seeking to determine the mean 
square error of an observation or of an observational equation. 
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Corresponding changes would be made in the subscripts for 
other changes in the symbols employed. The expression 
above is equivalent to 


p 


2 

y»w\WiV>iWi 


2 ( 2 /*) 


where y represents a deviation from an origin at the mean 
of the F’s. But 

2(2/*) 2(7*) # 

N ~ N v 


where F represents the original values of the 7-variable 
and c v represents the difference between the original origin 
and the mean of the F’s. (The symbols c„, and c, should 
not be confused with c, one of the constants in the equation 
to the fitted curve.) 

Accordingly, we have 


P 


2 

V'V)\WiWiWi 


2(d*) 

2(7*) - Nc v *‘ 


( 10 ) 


But we have secured an expression for 2(t> 2 ) [the equivalent 
of 2(d*)] which holds in the case of a curve fitted by the 
method of least squares. Substituting the equivalent of 
2(d 2 ) in the above equation, and simplifying, we have, 
as a general formula for the index of correlation 

P*y. Wl te 3 tir,W4 ...” (tt) 

dL(WiY) + 62(T7 2 7) + c2(T7 3 7) + dZ(J7 4 7) + . . . - NcJ 

2(7*) - Nc y * 


This may be applied to a specific case by replacing Wi, 
W 2) W a, Wt, etc., in the above formula by the functions 
which appear as coefficients of the unknown quantities in 
the original equation. When all these are functions of a 
single independent variable, as in the usual case, the index 
of correlation would be represented by the symbol p vx . 


CERTAIN SPECIAL CASES 

In the case of multiple correlation, where the symbols 
Xi, Xt, X a , Xi, etc., are used to represent all the variables, 
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whether considered dependent or independent, the symbol 
R is employed for the measure of correlation and numerical 
subscripts utilized as described in the body of the text. 

In the case of a straight line relationship between two 
variables, p is replaced by the symbol r, which represents 
the ordinary coefficient of correlation. As the general 
equation for r we have 

. aS(Y) + b2(XY) - Nc„ 2 
r ~ 2(Y 2 ) - Nc v 2 

There are two special cases in which this formula may be 
simplified. If the origin be at the mean of the X’s, we 
have 

ZF 

a — C V Jf 


Nc 2 = aZF 

and the formula for r reduces to 

r , VS(xY) 

2(F 2 ) - Ncy 2 

If the origin be at the mean of the F’s (it is not essential 
that it be also at the mean of the X’s) 

2(y) = 0, and c u = 0 

and the formula for the coefficient of correlation becomes 

r 2 _ mm 

S(2/ 2 ) 

In this latter case the general formula for p may also be 
simplified by the elimination of the terms o2(y) and Nc v 2 . 

CHECKS ON THE FORMATION OF THE NORMAL EQUATIONS 

There are so many possibilities of arithmetical error in 
the formation and solution of a set of normal equations 
that checks should be employed wherever possible. A 
convenient check on the calculations leading to the normal 
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equations is afforded by the introduction in each observation 
equation of an additional term, s, equal to the sum of all 
the known quantities in that equation. Thus, in the following 
system of observation equations, formed in fitting a line 
to the points 1, 3; 2, 4; 3, 6; 4, 5; 5, 10; 6, 9; 7, 10; 8, 12; 
9, 11, the values of s are as indicated: 

s 

3 = a + 16 5 

4 = o -f- 2 6 7 

6 = a + 36 10 

5 = a + 46 10 
10 = a + 56 16 

9 = a + 66 16 

10 = a + 76 18 
12 = a + 86 21 

11 = a + 96 21. 

(The coefficient of a in each case is 1, and this is added to 
the other known quantities.) 

In fitting a curve described by the type equation 

Y = aWi + bW 2 + cW 3 + dWi 

the following relations prevail between s and the other 
quantities computed. For each observation equation, 

Y + W i + W 2 + W 3 + Wi = s. 

For the normal equations, 

2 (WiY) + 2(fiV) + 2(F,TF 2 ) + 2 (JFiF 3 ) + 2(WiTF 4 ) = 2(W 1 s) 
2(JF 2 F) + 2 (JFiTF 2 ) + 2(fiV) + 2 (W 2 W 3 ) + 2 (WW 4 ) = 2(TF 2 s) 
2 (W 3 Y) + 2(TF 1 TF 3 ) + 2(tfW 3 ) + 2(fiY) + 2(TF 3 TF 4 ) = 2(TF 3 s) 
2(IF 4 F) +2(^1^*) +2(F 2 1F 4 ) +2(1F 3 TF 4 ) + 2(TF 4 2 ) = 2(TF 4 s) 

This form is capable of application to any specific problem. 
In each case the s-equations are formed in precisely the 
same way as the corresponding normal equations. 

In applying these checks several additional columns are 
needed in the working tables, but the extra trouble is 
more than compensated by the opportunity to check the 
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work at each stage. The application is illustrated in the 
following working table, showing the calculations involved 
in fitting a second degree curve of the form 

Y = a + bX + cX 2 

to the nine points 1, 2; 2, 6; 3, 7; 4, 8; 5, 10; 6, 11; 7, 11; 
8, 10; 9, 9. 

Table A 


Illustrating the Use of Checks on the Formation of Normal Equations 


Y 

X 

X 2 

XY 

X 2 7 

s 

Xs 

X s s 

2 

1 

1 

2 

2 

5 

5 

5 

6 

2 

4 

12 

24 

13 

26 

62 

7 

3 

9 

21 

63 

20 

60 

180 

8 

4 

16 

32 

128 

29 

116 

464 

10 

5 

25 

50 

250 

41 

205 

1,025 

11 

6 

36 

66 

396 

54 

324 

1,944 

11 

7 

49 

77 

539 

68 

476 

3,332 

10 

8 

64 

80 

640 

83 

664 

5,312 

9 

9 

81 

81 

729 

100 

900 

8,100 

74 

45 

285 

421 

2,771 

413 

2,776 

20,414 


(Columns for X* and X 4 are omitted, as the values 2 (X s ) and 2(X 4 ) may be 
derived from prepared tables.) 


Each of the values in the column headed s is secured 
from the corresponding observation equation. Thus, from 
the first observation equation 

2 — la — f— 1 b — lc, 

we have 5 as the value of s (2, plus the coefficients of the 
three constants). These values of s are secured readily 
from the table by adding the figures in the columns headed 
Y, X, and X 2 , plus 1, the coefficient of the constant term a. 

Adding the various columns, the arithmetic work is 
verified by the following checks: 

2(Y) + N + Z(X) + 2(X S ) = 2(a) 

74 + 9 + 45 + 285 - 413 

2(XF) + 2(X) + 2(X 2 ) + 2 (X s ) = 2(Xs) 
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421 + 45 + 285 + 2,025 = 2,776 
2(X 2 F) + Z(X 2 ) + 2(X») + 2(X 4 ) = 2(X 2 s) 

2,771 + 285 + 2,025 + 15,333 = 20,414. 

Further uses of a check of this kind are explained below, 
in discussing the solution of the normal equations. 

OTHER TESTS 

The possibility of checking the calculations in other ways 
has been suggested in the preceding sections. Thus, where 
the coefficients of the constants in the equation to the 
fitted curve are represented by W h W 2) W 3 , W t) we know 
that 

Sft-lFt) = 0 
2<W 2 ) = 0 
2(t>TF 3 ) = 0 
2(t>TF 4 ) = 0. 

If a curve of the type 

F = a + bX + cX 2 + dX* 

has been fitted, this means that 

2 (») = 0 
20X) = 0 
2(«>X 2 ) = 0 
2(vX 3 ) = 0. 

The accuracy of the work may be tested by checking these 
relations. 

Finally, we may test the accuracy of the work by com- 
puting the standard error of estimate in two different ways. 
We may compute the separate residuals by taking the 
difference between computed and actual values of the 
dependent variable, and from these values determine S. 
This may be compared with the results secured by applying 
the general formula for the standard error, as derived above. 
In the fitting of the second degree curve, the data of which 
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were used to illustrate the method of checking the normal 
equations, the equation derived was 

Y = - .92860 + 3.52316X - ' 267316X 2 . 

From the residuals separately computed, we have 

S y = .4941. 

From the formula 

„ , S (F 2 ) - aS(Y) - bZ(XY) - c2(X 2 F) . 

Sy = W ’ 


we have 


Sy = .4947. 

This constitutes a final check upon the accuracy of the 
calculations. 


SIMPLIFICATION OF NORMAL EQUATIONS IN A MULTIPLE 
CORRELATION PROBLEM 1 

In the discussion of multiple correlation procedure in 
Chapter XVI the normal equations as first derived in the 
form 

I 2(Xi) = Na + 6 i 2 . 3 42 (X 2 ) + &i 3 . 242 (X 3 ) + 5i 4 . 23 2 (X ,j) 

II 2(XiX 2 ) = «2(X 2 ) + 6i 2 . 34 2(X 2 2 ) + 6 13 . 2 42(X 2 X 3 ) 

+ fe 14 . 2 32(X 2 X4) 

III 2(XiZ«) = «2(X 3 ) + b >2 . 3i X(X 2 X 3 ) + bu. 2 4 2(X 3 2 ) 

+ &h. 23 2(X 3 X 4 ) 

IV 2(XiX 4 ) = o2(X 4 ) + 6 12 .342(X 2 X 4 ) + 6«. 2 42(X 3 X 4 ) 

+ 6 14 . 23 2(X 4 2 ) 

were reduced in number and modified to facilitate their 
solution. Details of the method are here given. 

Letting A h A 2 , A 3 , and A 4 represent the arithmetic 
means of the several variables, and x lt x 2 , x 3 , and x 4 represent 
deviations from the means, we may replace the variables 

1 Adapted from H. R. Tolley and M. J. B. Ezekiel, “A Method of Handling 
Multiple Correlation Problem 3 , ” Journal of the American Statistical Associa- 
tion , Vol. 18, 993-1003. 
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X h X 2 , X z , and X 4 by their equivalents x x + A lf x 2 + A 2 , 
Xz + As, x 4 + A 4 . The normal equations now become: 

I S(xi + A \) = Na + Sfe + A2) • 612.34 + 2(#3 4 - A3) • 613.24 
+ S(x 4 + A 4) • 614.23 

II 2[(#i + A 1) (x 2 + A 2 )] — 2 [fe + A2) • a + 2(^2 + A2) 2 ] • 612.84 
+ 2[(£2 + A 2 )(Xz + A3)] • 613.24 
+ 2(#2 + A 2 )(x 4 + A 4) • 614.23 

III 2 [(xi + 4 i)(x 3 + - 43 )] = 2(2:3 + 4.3) ■ a 

+ 2[(X3 + 4 . 3) (2:2 + 42)] ' 612.34 + 2(2:3 + 43)* • & 13.24 
+ 2[(x3 -f- 4 s)(x 4 -f- 44)] • 614.23 

IV 2 [(xi + 4 i)(x 4 -f- 4 4)] = 2 (x 4 + 44) - a 

+ 2[(x 4 -+- 4 4) (x 2 + 42)] • 612.34 

+ 2[(x 4 4 4) (2:3 + 43)] • 613.24 + 2(^4 + 44) 2 • 614.23. 

Since 2 ( 2:1 + 4 X ) = 2x t + NA X , and since 22:1 = 0, 2(xi + 4 a ) 
and all similar expressions may be replaced by NA X , NA%, etc. 

If we expand 2(x 2 d- 4 2 ) 2 to 2(x 2 2 -f 24 2 x 2 + 4 2 2 ), the 
middle term drops out, because 2x 2 = 0, and the expression 
may be written 2x 2 2 + NA 2 2 . The sums of all similar 
squares may be put in similar form. 

The product sum 2(xi + 4 i)(x 2 + 4 2 ) = S(xiX 2 + 4 iX 2 
+ 4 2 Xi + 4 ^ 2 ) = 2xiX 2 d- NAiA 2 since 2xi = 0 and 2x 2 
= 0. Product sums of the same type may be similarly 
modified. The normal equations now take the form: 

I iV 4 i = Na d - N A 2612.34 + iV 4 3613.24 d - N A 4614.23 
II 2 (xix 2 ) d - iV 4 i 4 2 = N A 2 a -f- [2(x2) 2 d - A/4 2 2 ]6i2.34 

d* [S(x 2 x 3 ) d* N 4 2 43]6i 3. 2 4 d~ [2(x 2 X4) d* A4.24 4)614.23 

III 2(11X3) jV 4 i 4 3 = 44 ?fl d- [ 2 (x 2 X 3 ) d~ NA 2 A 3 ] 6 i 2.34 

d- [ 2 (x 3 ) 2 + A r 4 ,'j 2 ] 6 i 3.24 d~ [ 2 (x 3 X 4 ) d - A r 4 34 4)614.23 

IV 2 (xix 4 ) d- NAiAi = NA,a d- [ 2 (x 2 x 4 ) d- NA 2 A 4 ]b 12 . 3i 

d- [2(xjx 4 ) -+• A r 4 3 4 4 ] 6 i 3.24 d~ [2(x 4 ) 2 -f- A r 4 4 2 ]6 1 4. 2 3. 

If we now divide through by N, and substitute p w for 
a 2 for \ and similar symbols for other mean 
products and mean squares, the normal equations become 
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I Ai = a -f* A2&12.34 4" A 3613.24 + A 4614.2s 
II Pit 4* A1A2 = A2O 4" ( Ot 2 4* A2 2 )6l2.34 + (?23 4" A2As)6l3.24 

+ (?*24 4* A2A 4 ) 614.23 

III Pit + A1A3 = A30 + (P23 + A 2^3)612. 34 + (ff 3 2 + As 2 ) 613.24 

4 * (ps 4 4 " A 3 A 4)614.23 

IV pu 4* A1A4 = A40 4- (P24 4* A2A 4) 612.34 4- (?34 4- A sA 4)613.24 

4“ (<T4 2 4* A4 2 )6i4.23- 

These four simultaneous equations may now be reduced 
to three. We multiply equation I, throughout, by At, 
and subtract the result from equation II; we then multiply 
equation I by A 3 , and subtract the result from equation III; 
we then multiply equation I by A 4) and subtract the result 
from equation IV. All the terms containing A’s are thus 
eliminated and we obtain the three normal equations 

P12 = 02 2 6l2.34 4- P23613.24 4" P24614.23 
Pl8 — P23612.34 4- 03 2 6l3.24 4" P34614.23 
P14 — P24612.34 4- PS4&13.24 4" 04 2 6l4.23. 

Inserting the observed values of the p’s and the <r’s, these 
are solved for the coefficients b. The value a may then be 
obtained by inserting the values of the A’s and the 6’s 
in the equation 

Ai = a 4- A2612.34 4* A3613.24 + A 461 4. 23- 
solution OF THE NORMAL EQUATIONS 

The task of solving the normal equations is not a difficult 
one in most of the cases presented to the economic statis- 
tician. If there are only two or three unknowns the corre- 
sponding number of normal equations may be solved by 
simple algebraic methods. Even with three equations, 
however, it is advisable to employ a systematic procedure, 
and with more than three equations this is imperative. 
Such systematic methods of solving the simultaneous equa- 
tions which are met with in connection with the method 
of least squares have been worked out by Gauss and by 
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Doolittle. The latter method, which is perhaps the more 
convenient for general usage, is demonstrated below. 

The coefficients of the unknowns in the normal equations 
are always symmetrical with respect to the principal diago- 
nal. Thus in securing the most probable values of the 
constants in the equation 

Y = aW 1 + bWi + cW a + dW i} 
we have the four normal equations 

aSCflV) + VZ{WiWi) + cZ{W x W 3 ) + dZ{W x W x ) - Z{W X Y) = 0 
oLiWtWa) + 6Z(WV) + cZ(W 2 W } ) + dZ(W 2 W x ) - 2(1F S F) = 0 
dZ(W x W 3 ) + bZ(W 2 W 3 ) + cZ(WV) + dZ{W 3 W x ) - Z(W 3 Y) = 0 
aZ(WiWi) + 6Z(TF 2 TF 4 ) + cZ(W 3 W 4 ) + dZ(Wfi - 2(TF 4 F) = 0 

The symmetrical arrangement about the diagonal, when 
F-terms are neglected, is obvious. Starting with any term 
on the principal diagonal, we have the same coefficients 
directly above as to the left. Thus, above the diagonal 
term in which the coefficient 2(WV) appears, we have the 
coefficients Z(W 2 W 3 ) and Z(W X W 3 ). The same coefficients 
are found to the left of the given diagonal term, and on 
the same line. For the purposes of solution, therefore, 
the terms to the left of each diagonal entry may be omitted, 
and we may put the remaining terms of the normal equations 
in the form 

a2(WV) + bZ(W x W 2 ) + cZ(W x W 3 ) + dZ(W x W x ) - 2(TFiF) 

+ 62(TF 2 2 ) + cS(WW 3 ) + dZ{W 2 W t ) - 2 (W t Y) 

+ c2(WY) + dZ(W 3 W t ) - Z(TF»F) 

+ d2( WS) - 2(JF 4 F). 

THE DOOLITTLE METHOD 

The Doolittle method may be illustrated with reference 
to the following normal equations: 

8.3564a + 2.7906 + 2.932c + 47.967 = 0 
2.790a + 6.66456 + 2.063c + 62.039 = 0 
2.932a + 2.0636 + 7.7893c + 47.519 = 0. 
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Putting these, for the purposes of the solution, in the 
abbreviated form given above, we have 

8.3564o + 2.7906 + 2.932c +47.967 
+ 6.66455 + 2.063c + 62.039 
+ 7.7893c + 47.519. 

We wish to solve these for the constants a, b, and c. All 
the work of computation, with the necessary checks, is 
shown in the following table : 


Table B 

Solution of Normal Equations by the Doolittle Method 



(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Line 

Reciprocals 

a 

6 

c 


5 

I 


8.3564 

2 790 

2 932 

47 967 

62 0454 

II 



6 6645 

2 063 

62 039 

73 5565 

III 




7 7893 

47 519 

60 3033 

1 


8 35640 

2 790 

2 932 

47 967 

62 0454 

2 

- .11906876 

- 1 00000 

- 333S76 

- 350869 

- 5 740151 

— 7.424896 check 

3 



6 6645 

2 063 

62 039 

73 5565 

4 



- 931514 

- 978924 

- 16 015030 

- 20 715470 

5 



5 732986 

1 084076 

46 023970 

52 841030 check 

6 

- .17442917 


- 1 000000 

- 189094 

- 8 027923 

- 9 217017 check 

7 




7 7893 

47 519 

60 3033 

8 




— 1 028748 

- 16 830133 

- 21 769807 

9 




— 204992 

- 8 702857 

- 9 991922 

10 




6 555560 

21 986010 

28 541571 check 

11 

- . 15254227 



- 1 000000 

- 3 353796 

— 4 353796 check 


Back Solution 

c b a 

— 3 3 53796 - 8 027923 - 5 740151 

— 3.353796 + 634183 +2 468592 

— 7.393740 ± 1 .176743 

— 2.094816 

a = - 2.094816 
b = - 7.393740 
c = - 3.353796 

Check: 

Equation I: 

8.3564a + 2.7906 + 2.932c = - 47.967. 
Substituting the given values, 

8 3564 (— 2.094816) + 2.790(- 7.393740) 

+ 2.932(— 3.353796) = - 47.966985. 
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Explanation . — The coefficients of the unknown quan- 
tities, a, b, and c, are listed in the designated columns. 
The known term in each normal equation is listed in col- 
umn (5). (The sign of this known term, it should be noted, 
is that which it would have when the entire expression, of 
which it is one term, is equated to zero.) Column s is 
employed as a check. The value in column s, in each of 
the lines I, II, and III, is the algebraic sum of the known 
values in the given normal equation. In securing this 
sum the coefficients to the left of the diagonal, which have 
been omitted from the table as it stands, must be included. 

The following is a summary of the procedure in solving 
the normal equations: 

1. In line (1) write normal equation I. 

2. In line (2), column (1), write the reciprocal of the value in 
line (1), column (2), with sign changed . (This is the reciprocal of the 
coefficient of a.) Multiply each item in line (1) by this reciprocal, 
entering the products in the corresponding columns in line (2). 
[The algebraic sum of the items in columns (2), (3), (4), and (5) of 
line (2) should equal the value in column (6).] This operation has 
eliminated the unknown a, by expressing it in terms of b and c. 
[The — 1 in line (2), column (2), has been included only to facili- 
tate the checking process. The same is true in lines (6) and (11)./ 
A heavy line may be drawn across the table below line (2). 

3. Write normal equation II in line (3). 

4. Multiply by the coefficient of b in line (2) (i.e., — .333876) 
the items in columns (3), (4), (5), and (6) in line (1). Enter the 
products in the corresponding columns of line (4). 

5. Add lines (3) and (4), entering the sums in line (5). [The 
algebraic sum of the items in columns (3), (4), and (5) of line 
(5) should equal the value in column (6).] 

6. In column (1), line (6), enter the reciprocal of the value in 
column (3), line (5), reversing the sign. Multiply each term in line 
(5) by this reciprocal, entering the products in line (6). [The sum 
of the items in columns (3), (4), and (5) of line (6) should equal the 
value in column (6).] This operation has eliminated the unknown 6, 
by expressing it in terms of c. A heavy line may be drawn across 
the table below line (6). 

7. Write normal equation III in line (7). 

8- Multiply by the coefficient of c in line (2) (i.e., — .350869) 
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the items in columns (4), (5), and (6) of line (1). .Enter the products 
in the corresponding columns of line (8). 

9. Multiply by the coefficient of c in line (0) (i.e., — . 189094) 
the items in columns (4), (5), and (6) of line (5). Enter the products 
in the corresponding columns of line (9). 

10. Add lines (7), (8), and (9), entering the sums in line (10). 
[The algebraic sum of the items in columns (4) and (5) of line 

(10) should equal the value in column (6).] 

11. In column (1), line (11), enter the reciprocal of the value in 
column (4) of line (10), reversing the sign. Multiply each term in 
line (10) by this reciprocal, entering the products in line (11). 
[The algebraic sum of the items in columns (4) and (5) of line 

(11) should equal the value in column (6).] This operation gives 
the value of c, which is found in column (5) of line (11). A heavy 
line may be drawn across the table below line (11). 

Were there additional unknowns, as d and e, this last 
operation would have given c as a function of d and e and 
it would be necessary to carry the process still further, 
repeating the steps taken above. The next operation would 
be to bring down the fourth normal equation, entering it 
in line (12). Then the coefficients of d in lines (2), (6), and 
(11) would be used to multiply the necessary items in 
lines (1), (5), and (10), the products being entered in lines 
(13), (14), and (15). The sum of the items in lines (12), 
(13), (14), and (15) would be entered in line (16) and 
checked by the item in the s column. Multiplying through 
by the reciprocal of the coefficient of d in line (16), with 
sign reversed, the value of d would be obtained in terms 
of e. The value of e would be derived in a similar fashion. 

The checks on these various operations have been indi- 
cated in the table. The testing of the results at each step 
reduces the possibility of error to a minimum. 

The back solution presents no difficulties. We have, 
from line (11), 

c = - 3.353796, 
from line (6) 

b = - .189094c - 8.027923, 
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from line (2) 

a = - .3338766 - .350869c - 5.740151. 

[The items in column (6) are inserted merely as checks. 
The items — 1 . 000000 which appear in lines (2), (6), and 
(11) are inserted to assist in the checking.] 

The computations involved in the back solution appear 
in the table. 

A final check is afforded by inserting the values secured 
by this process in one of the normal equations. This check, 
as carried out for equation I, is shown below the table. 
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APPENDIX B 


DERIVATION OF FORMULAS FOR MEAN AND 
STANDARD DEVIATION OF THE BINOMIAL 
DISTRIBUTION » 


For convenience we put the binomial in the form (g + p) n , 
where g = probability of a failure, p = probability of a 
success, and g + p = 1. Expanding the binomial, we have 


(? + p) n = 3" + nq n ~ l p l + 


n(n - 1) „ n _ 


+ 


1 -2 
n(n — 1 )(n 


gn—2p2 


1-2-3 


— q n ~ 3 p 3 -f 


+ P". 


The terms of this expansion indicate, in order, the probable 
frequencies of no successes, 1 success, 2 successes, 3 suc- 
cesses, and so on, to n successes. A frequency table of the 
familiar type may be constructed from these materials. 

The items in col. (2) of Table C constitute the terms of 
the binomial expansion. Their sum is thus equal to (g + p) B , 
which is, by definition, equal to 1. The items in col. (3), 
added in order, give 

-f- n(n — l)g" _ 2 p 2 + — —q n ~ 3 p 3 

. n(n — l)(n — 2 )(n — 3) ... . 

+ 1 . ' 2 .3 1 S' P + • • -+np n . 


Since the factors n and p appear in each of these terms, this 
reduces to 


»pjV -1 + (» - iKg” -2 ? 1 ) + — — — — 3 n_3 p 2 


+ 


(n — l)(n — 2 )(ra — 3) 
1-2-3 


qn-4pZ 



1 These derivations are adapted from the proof given by D. C. Jones in 
A First Course in Statistics . London, Bell & Sons, 1921, 143-145. 
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But the terms within brackets, following np, represent the 
expansion of the binomial (q + p) n_1 . Since q + p = 1, 
the sum of these terms is 1. Accordingly the sum of the 
items in col. (3) reduces to 

np(q + p) n ~ l = np. 

For the mean of this distribution we have 


= SCW = »p 

2(f) 1 p 

Adding the items in col. (4) in order, we have 


3n(n - 1)Q - 2) 


»~s 


1 -2 


q n ~°p 


nqn-ipi -j- 2 n(n — 1 )g ,n_2 p 2 + 

4 n(n - l)(n - 2)(w - 3) 44 . 

+ P1T~3 9 P+ ‘ 




1 + 2 (» - l)?" -2 ? 1 + 


3(» ~ 1)(» ~ 2) , 


3-n2 


1 -2 


q n ~p 


+ nV 


+ — — 1 1^2 ~ g n-4 P 8 + • • • + n P n l J' 


The terms within brackets may be broken into two groups, 
giving 

np£ 1 9 B_1 + (« - l)?" -2 ? 1 + — — — — $ n-a P 2 
+ ( n- 1) (n- ^ (n- _ 3) r _ 4p 3 + + pn -, } 

+ { (n - l)g"-y + 2 ^ n — pry- — 2 -g"~ 3 P 2 
+ 3(rt - l)(n - 2)(n - 3) g „. 4p3 + + (n _ i) p »-i } ]. 

The terms within the first of these two groups constitute 
the expansion of the binomial ( q + p)” -1 . These terms 
may be replaced by that binomial; the second group of 
terms may be simplified, since they contain the common 
factors n — 1 and p. These operations give us 
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+ (n — 1 )p | q n ~ 2 + (n — 2 )q n ~ 2 p l 
+ ( n - 2Kp3) gn _ v + 



The second group of terms, thus simplified, is seen to be 
(n — l)p multiplied by the expansion of the binomial 
(? + P) n ~ 2 ‘ Thus we have, as the sum of the items in 
col. (4) of the preceding table, 


np[(q + p) n ~ l + (n - 1 )p(q + p)"~ 2 ]. 

But since q + p = 1, (q + p) n ~ 1 = 1 and (q + p)"~ 2 = 1. 
Accordingly, the total of col. (4) becomes 


np[ 1 + p{n - 1)]. 


As a general formula for the standard deviation, in 
squared form, we have 


N 


c 2 


where c is the difference between the mean of the distribu- 
tion and the arbitrary origin. In the present instance, 
the origin is at 0, or “no successes,” and c is equal to the 
mean, or np. N is equal to 2(/), or 1, in this case. Thus 
the standard deviation of the binomial distributions given by 

a 2 = np[ 1 + p(n — 1)] — n 2 p 2 
= np[np + (1 — p)) — n 2 p 2 
= n 2 p 2 + np(l — p) — n 2 p 2 
= np( 1 - p) 

= npq 
a — Vnpq. 
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DERIVATION OF THE STANDARD ERROR 
OF THE ARITHMETIC MEAN 

We have made n random, hence independent, observations 
on a given variable. The respective observations may be rep- 
resented by Xi, X 2 , X 3 . . . X„. Representing the sum of 
the n observations by W, we have 

W = X, + X 2 + X 3 + . . . + X n . (1) 

Additional samples are now taken until we have N values of 
Xi, N values of X 2 , etc., and hence N values of the sum, W. 
We have N samples, therefore, of n observations each. The 
mean values, which we may represent by barred letters, 
stand in the same relationship of equality: 

W = X 1 + X 2 + X 3 + . . . + X n . (2) 

Using small letters ( w , Xi, x 2 , etc.) to define deviations of 
the actual observations from these mean values, we may 
write, for any given sample, or series of observations, 

w = Xi + x 2 + Xi + . . . + x n . (3) 

Squaring the two sides of this equation, we have 

10 s = Xi* -J- X2* "t~ Xi 2 -f- . . . Xn 2 ~f“ 2 TiX 2 -f- 2XjXs d* . . . 

+ 2xi x„ + 2 x 2 X 3 + 2x 2 x» + . . . 

+ 2x 3 x„ + . . . . (4) 

Each term on the right-hand side of (3) will appear in squared 
form in (4), and there will also appear product terms of the 
form 2 xix 2 corresponding to all possible pairings of the terms 
on the right-hand side. 

The next step involves the summation of the equations 
of type (4), derived from the N samples, and division 
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throughout by N. Each product term, when thus summed 
and divided by N, will be of the form 

22z 1 X2 


This, with the modification introduced by the factor 2, re- 
sembles the familiar mean product, encountered in cor- 


relation procedure. This mean product, we have seen, has a 
value of zero when the variables x and y are uncorrelated. 
But, by hypothesis, the observations that have given us 
x h x 2) x 3 , etc., are independent of one another, and hence 
these variables are uncorrelated. Accordingly, each of the 
product terms, derived when N equations corresponding 
to (4) above are summed and divided by N, is equal to zero. 
The process of summation and division gives us, therefore, 


_ Ezi 2 Ita 2 . 2 z 3 2 - i 2x„ 2 

~N ~ ~N~ + IT + - ‘ ’ + ~N~ 


or 

O’w 2 — ffl 2 + 0"2 2 + 0"3 2 + • • • + On 2 - (6) 

If all the observations relate to the same universe (i.e., if 
the samples are all drawn from the same parent population), 
which is true, by hypothesis, the standard deviations appear- 
ing in the right-hand member of equation (6) are equal to 
one another and to the standard deviation of the population. 
Accordingly, using <r to represent that standard deviation, 
we have 


<X W 2 = TUT. (7) 

The next argument, that leads directly to the desired 
measurement, follows precisely these steps, which have been 
given in the above form to indicate the reasoning involved. 
It starts, however, with a variant form of equation (3). 
Dividing that equation throughout by n, we have 


W = Xi *2 l3 , Xn. 

n n'n n ’ n 


( 8 ) 
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Working with the variables — , — , — , etc., just as we have 

n n n 

done with w, x u x 2) etc., we may go through the operations 
represented by equations (4), (5), and (6), above. The 
product terms disappear, as in passing from (4) to (5). In 


w • • 

the process of squaring the term - is treated as an entity; 

7 X 

the sum of the squared values is thus 2 (^j • Numerator and 


denominator of each of the terms of type — are squared 

n 

separately, however, and the sum is of the form — j. Division 

7b 

throughout by N then gives the quantities appearing in equa- 
tion (9), which corresponds to equation (6). 


<Tw‘ 


<r i 2 . qy 8 , o~3 2 . 
n* + n 2 i " n 2 



(9) 


Since all observations relate to the same universe, this re- 
duces to 


From this 



( 10 ) 



n 



(ID 


But w is the sum of n observations drawn from a universe 


w • 

having a standard deviation of <r, and - is the mean of these 

n 

observations. <Tw is the standard deviation of a distribution 

n 

of arithmetic means, corresponding to the familiar symbol 
a M . This is the desired expression for the standard error 
of the arithmetic mean, appropriate for use when the <r 
of the population is known. Where a is estimated from the 
standard deviation of a sample, accuracy is increased by 
using Vn — 1 rather than Vn in the denominator of the 
right-hand member of (11). 
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ILLUSTRATING THE MEASUREMENT OF TREND 
BY A MODIFIED EXPONENTIAL CURVE, A GOM- 
PERTZ CURVE AND A LOGISTIC CURVE 

The discussion in Chapter VII of mathematical functions 
suitable for use in measuring the secular trends of time series 
dealt with types required in ordinary practice. We here 
discuss briefly three other types suited to the measurement 
of long-term movements in economic and business series. 

The Modified Exponential Curve 
An exponential curve, which plots as a straight line on 
ratio paper, is a suitable measure of trend for a series that 
is increasing or decreasing at a constant rate, that is, one 
that shows constancy of relative growth. The figures defin- 
ing the successive trend values of a series of this type con- 
stitute a geometric progression. The trends of certain eco- 
nomic series that depart from constancy of relative growth 
may be accurately defined by a simple modification of the 
exponential curve. This is the case when the observed 
values may be transformed, by the addition (or subtraction) 
of a constant magnitude, to a series closely approximating 
such a geometric progression. 

If we represent by K the constant magnitude that is to 
be added (algebraically) to each observed value in effecting 
the desired transformation, the task of fitting the trend line 
involves the following steps: 

Determination of K. 

Correction of observed values by K, to obtain the modified series. 
Fitting an exponential curve to the modified series, and computer 
tion of trend values of the modified series. 

667 
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Correction of trend values of the modified series by K to obtain 
trend values of original series. 

If y represents the ordinates of trend of the original series 
and x represents time, the equation to the desired line of 
trend may be put in the form 

y = ab z — K 

where K is the correction factor noted above and a and b 
are constants to be determined by fitting an exponential 
curve to the modified series. The procedure may be illus- 
trated with reference to the data in Table D. 


Table D 

Illustrating the Fitting of a Modified Exponential Curve 
Production of Rayon Filament Yarn in the United States, 
1920-1931 1 


(Data in thousands of pounds) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Year 

Original 

series 

(observed) 

Group 

mean 

Modified 

series 

Trend values , 
modified 
series 

Trend values , 
original 
series 




(2 ) + K 


(5) — K 

1920 

10,125 


27,669 

29,108 

11,564 

1921 

14,986 

Mi = 

32,530 

34,363 

16,819 

1922 

24,067 

21,034.25 

41,611 

40,565 

23,021 

30,344 

1923 

34,959 


52,503 

47,888 

1924 

36,328 


53,872 

56,532 

38,988 

1925 

51,049 

M 2 = 

68,593 

66,736 

49,192 

1926 

62,693 

56,406.25 

80,237 

78,782 

61,238 

1927 

75,555 


93,099 

93,003 

75,459 

1928 

97,232 


114,776 

109,790 

92,246 

1929 

121,399 

M z = 

138,943 

129,608 

112,064 

1930 

127,333 

150,879 

124,210.75 

144,877 

153,003 

135,459 

1931 


168,423 

180,621 

163,077 


In employing this method we approximate K empirically 
by breaking the observed series into three parts, represent- 
ing equal periods of time, and determining the mean of the 
1 Data from Textile Economics Bureau. 
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observations for each period. We may designate these means, 
in chronological order, by Mi, M 2 , and M 3 . The desired 
value, K, is given by 

K = [MS - C Mi X Ms)] - [(Mi + Ms) - 2M 2 ]. 

If the observed series constitute a geometric progression 
the value of K will be zero; if the addition of a constant 
magnitude to the members of the original series will yield 
a series approximating a geometric progression, K will be 
positive; if the subtraction of a constant amount from the 
observed values will yield a series approximating a geometric 
progression, K will be negative. (In practice, K is given 
the sign obtained by the employment of the method de- 
scribed above, and then added algebraically to the observed 
series.) 

In the present case we have 

K = [(56,406.25)* - (21,034.25 X 124,210.75)] 

-h [(21,034.25 + 124,210.75) - (2 X 56,406.25)] 

= + 17,544. 

Adding this amount to each of the values recorded in 
col. (2) of Table D, we obtain the modified series in col. (4). 
In fitting an exponential curve to the modified series, it is 
desirable to use logarithms, that is, to solve the constants 
in an equation of the type log y = log a + (log b)x. This 
procedure was explained in Chapter VII. For log a of this 
curve we obtain 4 . 824359, and for log b . 072068. (The origin 
is at 1925.) The antilogarithms of the series of trend values 
thus obtained are given in col. (5). These define the trend of 
the modified series. Subtracting K (algebraically) from these 
values we obtain the trend values of the original series, which 
appear in col. (6). 

The original series measuring production of rayon fila- 
ment yarn and the modified exponential curve fitted to this 
series are shown graphically in Fig. A. 

It is essential that the three M’s used in the determinar 
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Year 


Fio. A. — Total Production of Rayon Filament Yarn in the United States, 
1920-1931, with Modified Exponential Trend 

tion of K relate to equal numbers of observations and that 
the midpoints, in time, of the three periods be equidistant. 
In the above example the number of years included in the 
period is a multiple of three, and no difficulty arises. If the 
number of years included is not a multiple of three, intervals 
that overlap slightly may be employed. For example, if our 
series had run from 1920 to 1932, the three averages might 
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have been derived, respectively, from the five-year periods 
1920-1924, 1924r-1928, 1928-1932. These would center, re- 
spectively, at 1922, 1926, and 1930, and would thus be equi- 
distant in time from one another. Alternatively, if monthly 
data are available, division of the total period into three 
equal parts may be facilitated by using a time-unit of 4 or 
8 months, rather than 12 months. 

The Gompertz Curve 

The Gompertz curve, which has important uses in actu- 
arial science, has had some application in the study of eco- 
nomic and business trends. 1 The term “growth curve” is 
applicable to it, since it portrays a process of cumulative 
expansion to a maximum value. This expansion proceeds 
by decreasing absolute increments in the later stages, but 
continues to the end without retrogression. It may not be 
assumed that this form of growth is typical of all industrial 
development, but the curve has value as an empirical rep- 
resentation of certain trend movements. 

For the purpose of fitting, the equation to the curve is 
transformed from the natural form 

■y = ab c ‘ 

to the logarithmic form 

log y = log a + (log b)c x . 

When fitted to an appropriate set of observations, measur- 
ing the expansion of an industry or the growth of an eco- 
nomic element, log a is the logarithm of the maximum value 
— the ceiling that the curve approaches. The second term 
measures the amount by which the trend value at a given 
time falls short of this maximum, an amount that diminishes, 
of course, with the passage of time. (The series for which 
this curve is an appropriate measure of trend will be ex- 
panding by decreasing amounts in the later stages of the 
period covered, and c, derived in the manner indicated below, 
will have a value between zero and unity.) The origin on 

1 Cf. Raymond B. Prescott, “Law of Growth in Forecasting Demand . K 
Tournal of the American Statistical As8ociation f Dec., 1922. 



672 


MEASUREMENT OF TREND 


the x-scale (time) is taken at the year to which the first 
entry relates. 

The method employed in fitting this curve is an approxi- 
mative one, since the least squares procedure in customary 
form is not applicable. Here, as in the preceding example, 
the series is broken into three equal portions. The sum of 
the logarithms of the observations in each of these segments 
is obtained; from these sums, and the differences between 
them, the necessary constants may be computed. The 
method is illustrated with reference to the data of rayon pro- 
duction for the years 1920-1937, which appear in Table E. 


Table E 

Computation of Quantities Required in the Fitting of a 
Gompertz Curve 

Production of Rayon Filament Yam in the United States, 1920-1937 
(Data in thousands of pounds) 



Rayon 



First 

Year 

production 

y 

10,125 

Logy 

Sub-totals 

differences 

1920 

4.0053950 



1921 

14,986 

4.1756857 



1922 

24,067 

4.3814220 



1923 

34,959 

4.5435590 

S 1 = 


1924 

36,328 

4.5602415 

26.374290 


1925 

51,049 

4.7079872 



1926 

62,693 

4.7972191 


d 1 =S 2 - Sx 

1927 

75,555 

4.8782632 


= 3.656786 

1928 

97,232 

4.9878092 



1929 

121,399 

5.0842151 

Si = 


1930 

127,333 

5. 1049409 

30.031076 


1931 

150,879 

5.1786288 



1932 

134,670 

5.1292709 


d>2 t=t Sz S 2 

1933 

213,498 

5.3293938 


= 2.095138 

1934 

208,321 

5.3187331 



1935 

257,557 

5.4108734 

St — 


1936 

277,626 

5.4434601 

32.126214 


1937 

312,236 

5.4944829 

88.5315830 
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We may use n to define the number of terms entering 
into each of the three sub-totals (in the present example 
n = 6) ; the sub-totals are represented, in chronological or- 
der, by Si, S 2 , and S 3 ; the first differences between the sub- 
totals are represented by di and d 2 . 1 We use these quantities 
in solving for the three constants c, log b and log a. The 
general relations from which these values are determined 
are the following: 



Inserting the proper quantities, we have 


c » _ c s _ 


2.095138 

3.656786 


= .572945 


c = \ / . 572945 = .911351 


log 5 = 


3.656786 X - .088649 
(.572945 - l) 2 


1.777493 


log o, = 26.374290 

= 5.822848. 


3.656786 \ 
572945 - 1 / 


The required equation is, therefore, 

log y = 5.822848 - 1 . 777493 (.911351*) 

in which x relates to deviations from an origin at the position 
of the first term. 

Substituting in this trend equation the values of x given 
in Table F, logarithms of the trend values are obtained. The 
corresponding natural numbers define the course of the line 
of trend. The method of calculation is indicated in Table F. 

1 The condition, previously noted, that the series to which the curve is to 
be fitted be one that is expanding by decreasing increments in the later 
stages of the period covered, is met when d% is less than d%. 
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Table F 

Illustrating the Computation of Ordinates of Trend of a Gompertz 
Curve Fitted to Data of Rayon Production, 1920-1937 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Year 

X 

c* 

(log b)<? 

logy 

(4) A- log a 

y 

Anti-log of (5) 
(in thousands 






of pounds) 

1920 

0 

1.000000 

- 1.777493 

4.045355 

11,101 

1921 

1 

0.911351 

- 1.619920 

4.202928 

15,956 

1922 

2 

0.830560 

- 1.476315 

4.346533 

22,209 

1923 

3 

0.756932 

- 1.345441 

4.477407 

30,020 

1924 

4 

0.689830 

- 1.226168 

4.596680 

39,508 

1925 

5 

0.628677 

- 1.117469 

4.705379 

50,743 

1926 

6 

0.572945 

- 1.018408 

4.804440 

63,744 

78,474 

1927 

7 

0.522154 

-0.928125 

4.894723 

1928 

8 

0.475865 

- 0.845847 

4.977001 

94,842 

1929 

9 

0.433681 

- 0.770865 

5.051983 

112,715 

1930 

10 

0.395235 

- 0.702527 

5.120321 

131,923 

1931 

11 

0.360198 

- 0.640249 

5.182599 

152,265 

1932 

12 

0.328267 

- 0.583492 

5.239356 

173,523 

1933 

13 

0.299166 

-0.531765 

5.291083 

195,471 

1934 

14 

0.272645 

- 0.484625 

5.338223 

217,883 

1935 

15 

0.248475 

- 0.441663 

5.381185 

240,539 

1936 

16 

0.226448 

- 0.402510 

5.420338 

263,231 

1937 

17 

0.206374 

-0.366830 

5.456018 

285,771 

1947 

27 

0.081566 

-0.144983 

5.677865 

476,283 

1957 

37 

0.032238 

-0.057303 

5.765545 

582,834 

1967 

47 

0.012741 

- 0.022647 

5.800201 

631,245 


The original data and the Gompertz curve fitted to them 
are shown graphically in Fig. B. 

The ceiling to this curve is set by the constant a, which 
has a value of approximately 665,000,000 pounds. This 
indicates that if the extrapolation of the trend of rayon 
production from 1920 to 1937, as measured by a Gompertz 
curve, accurately defines the future course of production, 
the maximum output to be expected is 665 million pounds 
per year. (It need hardly be pointed out that this extra- 
polation involves some doubtful assumptions, and that no 
mystic significance is to be attached to it.) The years to 
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which the present data relate were years of rapid expansion 
in the industry. The slackening of the rate of increase, 
which is to be expected in a mature industry, had not be- 
come marked by 1937. In order that the nature of the curve 
may be clear, extrapolated values for 1947, 1957, and 1967 
are given in the table, and the projection of the trend is 



1920 1925 1930 1935 1940 1945 1950 1955 1960 1965 


Fig. B. — Total Production of Rayon Filament Yam in the United States, 
1920-1937, with Gompertz Trend Line Extrapolated to 1967 

shown in Fig. B. After 1947, and still more conspicuously 
after 1957, the curve shows a notable dampening in the rate 
of expansion. We may not say that the industry will actually 
follow this course. In particular, the asymptote a may be 
expected to change, as conditions affecting the industry 
and the demand for its products vary in the future. Within 
the limits of the observations, however, the Gompertz curve 
serves as a satisfactory measure of trend. 

The Logistic Curve 

The logistic curve, sometimes termed the Pearl-Reed 
growth curve because of the extensive use made of it in 
population studies by Raymond Pearl and L. J. Reed, 
resembles somewhat the Gompertz curve discussed above. 
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Table G 


Computation of Quantities Required in the Fitting of a Logistic Curve 
to Data of Railroad Mileage Operated in the United States, 
by Five-Year Intervals, 1850-1935 


(1) 

(2) 

Miles of 

(3) 

(4) 

(5) 

Year 

railroad 

operated 

100,000,000 

y 

Sub-totals 

First 

differences 


y 




1850 

9,021 

11,085 



1855 

18,374 

5,442 



1860 

30,626 

3,265 



1865 

35,085 

2,850 

Si = 25,882 


1870 

52,922 

1,890 



1875 

74,096 

1,350 



1880 

93,262 

1,072 


d l = S% - Si 

1885 

128,320 

779 


= - 21,849 

1890 

156,404 

639 

St = 4,033 


1895 

177,746 

563 



1900 

192,556 

519 



1905 

216,974 

461 



1910 

240,831 

415 


d'l •= S3 — S 2 

1915 

257,569 

388 


= - 1,679 

1920 

259,941 

385 

S 3 = 2,354 


1925 

258,631 

387 



1930 

260,440 

384 



1935 

252,930 

395 




It represents a modified geometric progression, the growth 
of a series that tends to decrease as it approaches some 
specified limit. Like the Gompertz curve it may be used as 
an empirical approximation to the trends of certain economic 
series. Extrapolations are subject, of course, to the same 
uncertainties that attach to projections of other empirically 
derived trend lines. 

A form of this curve adapted to use as a measure of trend 
is defined by the equation 

- = a + be*. 

y 
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This, it will be noted, is the equation to a modified exponen- 
tial curve, except that the dependent variable is rather 

y 

than y. (The symbols here used for the constants differ 
somewhat from those employed in treating the modified 
exponential curve.) A method of fitting somewhat similar 
to those employed in the preceding examples may be em- 
ployed, with necessary modifications required by the use of 
reciprocals of y. The method may be discussed with refer- 
ence to the data of railroad mileage in Table G. Compu- 
tations are facilitated by multiplying the reciprocals of y by 
a suitable power of 10, as is done in col. (3) of this table. 

As in the two preceding illustrations, the observations are 
divided, chronologically, into three equal groups. Group 
sub-totals and the first differences between these sub-totals 
are computed. The symbol n is used for the number of terms 
in each of these sub-groups. The origin of the x-scale (time) 
is set at the date of the first observation. The time unit 
here employed is five years. 

The constants in the desired equation may be derived 
from the following relations. 



h _ di(c - 1) 
(c» - l) 2 





}• 


Substituting the given values, we have 


3^9 -+ 076846 
c - 076846 = .652034 


- 21,849(— .347966) 
( .076846 - l) 2 


+ 8,921 . 14 
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“ - si 26 ' 882 - (.076^ - 1) } - + 369 M - 

These results relate to initial observations which have been 
modified by the multiplication of - by 100,000,000. The 

y 

desired equation is, therefore, 

100 ’°P _ Q l M Q = 369.04 + 8,921. 14 (.652034*) 

y 

where x measures deviations in five-year units from an origin 
at 1850. 

Succeeding calculations are shown in Table H. 


Table H 

Computation of Ordinates of Trend of Logistic Curve Fitted to Data 


(1) 

(2) 

(3) 

Year 

X 

c* 

1850 

0 

1.000000 

1855 

1 

.652034 

1860 

2 

.425148 

1865 

3 

.277211 

1870 

4 

. 180751 

1875 

5 

. 117856 

1880 

6 

.076846 

1885 

7 

.050106 

1890 

8 

.032671 

1895 

9 

.021303 

1900 

10 

.013890 

1905 

11 

.009057 

1910 

12 

.005905 

1915 

13 

.003850 

1920 

14 

.002511 

1925 

15 

.001637 

1930 

16 

.001067 

1935 

17 

.000696 


of Railroad Mileage 


(4) 

(5) 

100,000,000 

bc x 

y 

(a + 5c*) 

8,921 

9,290 

5,817 

3,793 

6,186 

4,162 

2,473 

2,842 

1,613 

1,982 

1,051 

1,420 

686 

1,055 

447 

816 

291 

660 

190 

559 

124 

493 

81 

450 

53 

422 

34 

403 

22 

391 

15 

384 

10 

379 

6 

375 


( 6 ) 

y 

(100,000,000 x Jj) 

10,764 

16,166 

24,027 

35,186 

50,454 

70,423 

94,787 

122,549 

151,515 

178,891 

202,840 

222,222 

236,967 

248,139 

255,754 

260,417 

263,852 

266,667 
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The process of calculation is a straightforward one. The 
reciprocals of the entries in col. (5), multiplied by 100,000,- 
000, yield the desired trend values given in col. (6). These 
values, with the original series, are shown graphically in 
Fig. C. 

As in the case of the Gompertz curve, the logistic is suit- 
able for measuring the trend of a series that, in its later 



Year 

Fig. C. — Railroad Mileage Operated in the United States, by Five-Year 
Intervals, 1850-1935, with Logistic Trend 


stages, is growing by decreasing increments. The curve 
resembles an elongated S rising from a lower asymptote of 
zero to an upper limit indicated by the constant o. Since a in 
this case refers to an equation in which the dependent vari- 
able is 100,00Q ? 0 --, the actual asymptote is — -• 
y a 

From the given value of a, 369 . 04, we derive 270,973 miles 
as the upper limit of railroad mileage in the United States. 
As is clear from the table and chart, the actual values are 
close to this indicated limit. Barring the possibility of a 
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fundamental change in relevant conditions, the record and 
the curve fitted to it indicate that the era of railroad ex- 
pansion has ended. The extrapolation is, of course, subject 
to all the reservations that attach to the projection of other 
curves. There can be no doubt that, within the limits of 
the observations, the logistic curve gives an excellent rep- 
resentation of the actual history of railroad operation in the 
United States. 
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A FURTHER APPLICATION OF VARIANCE 
ANALYSIS 

The possibilities of Fisher’s method of variance analysis 
were far from exhausted by the several examples given in 
Chapter XV. We here supplement the treatment in that 
chapter by an additional example, illustrating further tests 
that may be made with a two-fold principle of classification. 

The observations on which this example is based consist 
of relative numbers, measuring the prices of 670 commodities 
in February, 1933, with average prices in 1926 taken as 100. 
February, 1933 marked the low point of the severe price 
decline that began in 1929. The questions to which our 
tests are directed relate to the relative severity of the declines 
occurring among different classes of goods. 

The 670 price relatives (obtained from price quotations 
compiled by the U. S. Bureau of Labor Statistics) may be 
classified into those relating to perishable goods (505 in 
number) and those relating to durable goods (165 in number). 
The classification has economic significance because of differ- 
ences in the market conditions, on both supply and demand 
sides, affecting these classes of goods during a major reces- 
sion. Again, the 670 observations may be broken down into 
those relating to raw materials (134 in number) and those 
relating to manufactured goods (536 in number). Applying 
the two principles of classification jointly we obtain 4 sub- 
groups, perishable raw materials (101 in number), perish- 
able manufactured goods (404 in number), durable raw 
materials (33 in number) and durable manufactured goods 
(132 in number). It is to be noted that the ratio of the num- 
ber of perishable raw materials to the number of perishable 

68 1 
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manufactured goods, 101:404, is the same as the ratio 
of the number of durable raw materials to the number of 
durable manufactured goods, 33 : 132. It is a necessary con- 
dition of the procedure here discussed that the frequencies 
in the several sub-groups be proportional. 

Various questions relating to the significance of these prin- 
ciples of classification may be answered with reference to the 
summary figures given in Table I. 


Table I 


Measurements Relating to the Analysis of the Relative Prices of 
670 Commodities for February, 1933 
(1926 = 100) 


1 

Perishable 
raw materials 

Ni = 101 
Ah = 41.663366 
2d 2 = 31,118.56 

3 

Durable 

raw materials 

N 3 = 33 
M, = 65.060606 
2d 2 = 12,217.88 

A 

All 

raw materials 
IV = I'M. 
m', = 47.425373 
2d* = 56,952.76 


2 

Perishable 
manufactured goods 

Ni = 404 
Mi = 62.329208 
2d 2 = 187,414.21 

4 

Durable 

manufactured goods 

Nt = 132 
Mi = 75.719697 
2d 2 = 31,308.63 

B 

All 

manufactured goods 

N m = 536 
M m = 65.626866 
2d 2 = 236,562.35 


I 

All 

perishable goods 

N p = 505 

= 58. 196040 
2d 2 = 253,040.57 

II 

All 

durable goods 

N d = 165 
M d = 73.587879 
2d 2 = 46,525.97 


All 

commodities 

N = 670 
M = 61.986567 
2d 2 = 329,029.89 


The entries relating to each group and sub-group define 
the number of commodities included, the mean value of the 
price relatives for February, 1933, and the sum of the 
squares of the deviations of the observations in that group 
from the mean of that group. Thus for perishable raw ma- 



PRICE RELATIVES 


683 


terials the mean is 41 . 663366 (indicating an average price 
decline of 58 . 34 per cent) and the sum of the squares of the 
deviations of the 101 observations in this group from 
41.663366 is 31,118.56. For all commodities the mean is 
61.986567, and the sum of the squares of the deviations of 
the individual items from this mean is 329,029 . 89. 

A TEST OF THE PERISHABLE-DURABLE PRINCIPLE OF 
CLASSIFICATION 

We may first ask whether the application of the two 
basic principles of classification, considered separately, gives 
groups showing significant differences in their price changes 
from 1926 to February, 1933. Examining the results of the 
perishable-durable distinction, we note that durable goods, 
with an average of 73.587879, show smaller price declines 
than perishable goods, for which the average is 58 . 196040. 
(Six decimal places are retained in the averages because these 
figures enter into later calculations.) Is the difference sig- 
nificant, or may it be attributed to chance? A test of the 
type discussed in Chapter XV provides an answer to this 
question. For the application of the test we must divide the 
total variability, 329,029.89, into a portion unaffected by 
perishable-durable differences and a portion that may be 
attributed to the play of forces directly related to this dis- 
tinction. 

The first of these portions, measuring the variability 
within classes, is derived directly from the figures in Table I. 

Variability within perishable group 

= 2 d 2 for that group = 253,040 . 57 

Variability within durable group 

= 2d 2 for that group = 46,525 . 97 
Total variability within classes 299,566 . 54 

In deriving a measure of the variability between classes 
we take the deviation of each class mean from the mean of 
all the observations, square this, and weight by the number 
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of observations in that class. Thus 
2eP between perishable-durable classes 

= [(61 . 986567 - 58 . 196040) 2 X 505] + [(61 . 986567 - 73 . 587879) 1 

X 165] = 29,463.31. 

A test of the significance of this classification reduces to 
the question whether the variability between classes is sig- 
nificantly greater than the variability within classes, when 
account has been taken of the number of degrees of freedom 
present in the two measures of variability. The appropriate 
2 -test is shown below. 


Nature of 
variability 

Degrees of 
freedom 
n 

Sum of 
squares 

Variance 

a 2 

Log, <r* 

Between classes 

1 

29,463.31 

29,463.31 

10.290900 

Within classes 

668 

299,566.54 

448.45 

6.105804 


669 

329,029.85 

Diff. 

= 4.185096 


2 = 2.09 

For ni = 1 and n 2 = 668 the 1 per cent value of z is approx- 
imately .95; the present value is materially greater than 
this. The variance between classes is significantly greater 
than the variance within classes. The results are not con- 
sistent with the hypothesis that the true value of z is zero. 
There is a significant difference between the February, 1933, 
price relatives of perishable and durable goods, on the 1926 
base. This principle of classification is a significant one, 
with reference to this aspect of price behavior. 

A TEST OF THE RAW-MANUFACTURED PRINCIPLE OF 
CLASSIFICATION 

The test of the other main principle of classification follows 
exactly the same lines. The total variability, 329,029.89, 
is broken into a portion measuring variability within classes 
(293,515.11), with 668 degrees of freedom, and a portion 
measuring variability between the raw-manufactured classes 
(35,514.75) with 1 degree of freedom. The value of z is 
2.20; the corresponding 1 per cent value of z is .95. This 
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principle of classification, also, is significant. Raw and 
manufactured goods differed significantly in degree of price 
change between 1926 and February, 1933. 

A TEST OF THE RESULTS OBTAINED FROM THE JOINT APPLICA- 
TION OF THE PERISHABLE-DURABLE AND RAW-MANU- 
FACTURED PRINCIPLES OF CLASSIFICATION 

The application of the two principles of classification dis- 
cussed above yields the 4 cells shown in Table I. We may 
ask whether the four groups thus distinguished — perishable 
raw materials, perishable manufactured goods, durable raw 
materials, and durable manufactured goods — are signifi- 
cantly different, judged with reference to the present obser- 
vations. The two essential elements of the total variability 
are derived from the figures in Table I in the manner indi- 
cated below. 

Variability within perishable raw materials group = 31,118.56 
Variability within perishable manufactures group = 187,414.21 
Variability within durable raw materials group = 12,217.88 
Variability within durable manufactures group = 31,308.63 
Total variability within cells 262,059 . 28 

This sum furnishes the yardstick that is used in the tests 
that follow. It is clear that it represents the action of forces 
other than those related to relative durability, or to degree 
of fabrication. For its four elements measure variability 
among commodities that are alike in respect of durability 
and alike in respect of degree of fabrication. 1 This sum is a 
measure of the strength of the forces we lump together 
as chance, which here means all factors affecting our observa- 
tions other than those related to the relative durability of 
commodities or to degree of fabrication of commodities. 

1 This statement may be accepted as accurate for the purpose of the present 
demonstration. Actually, of course, the distinctions between perishable and 
durable commodities and between raw and manufactured goods are not clear- 
cut and definitive. 
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A measure of variability between cells is derived as in the 
previous examples. 

2d* between cells = [(61.986567 - 41. 663366)* X 101] 

+ [(61.986567 - 62.329208)* X 404] 
+ [(61.986567 - 65.060606)* X 33] 
+ [(61.986567 - 75.719697)* X 132] 
= 66,970.60. 

The test of significance takes the following form. 


Nature of 
variability 

Degrees of 
freedom 
n 

Sum of 
squares 

Variance 

a 1 

Log, cr * 

Between cells 

3 

66,970.60 

22,323.53 

10.013395 

Within cells 

666 

262,059.28 

393.48 

5.975035 


669 

329,029.88 

Diff. 

= 4.038360 


2 = 2.02 

For tii = 3, n 2 = 666 the 1 per cent value of z is approxi- 
mately .67. The present value materially exceeds this. 
The conclusion is clear that the joint application of the two 
principles of classification yields sub-groups which differed 
significantly in their price movements between 1926 and 
February, 1933. 

FURTHER TESTS OF THE MAIN PRINCIPLES OF CLASSIFICATION 
The test applied in the preceding section does not bring 
out the most significant uses of a two-fold principle of classi- 
fication. In treating the four cells as we have, we have not 
made full use of the information we possess about them. The 
variance between cells, measured by the sum 66,970 . 60, with 
3 degrees of freedom, represents the combined influence of 
forces related to the perishable-durable principle of classifi- 
cation, to the raw-manufactured principle, and to the inter- 
action among forces related to these two principles. We 
may apply more refined tests, and obtain more accurate 
information about the differential price behavior of com- 
modities of different types, by distinguishing the components 
of the variance between cells. This is done in Table J, 
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which presents a complete breakdown of the total variance 
of the observations with which we are working. 


Table J 


Components of Variance among Observations Relating to Commodity 


Price Movements, 1926 — 

February, 1933 


Nature of 

(1926 - 100) 
Degrees of 

Sum of 

Variance 

variability 

freedom 

squares 

<r 2 

Between perishable-durable 

classes 

i 

29,463.31 

29,463.31 

Between raw-manufactured 

classes 

1 

35,514.75 

35,514.75 

Interaction (residual varia- 

bility between cells) 

1 

1,992.54 

1,992.54 

Within cells (“ experimental 

error”) 

666 

262,059.28 

393.48 


669 

329,029.88 



Having these components we may test with greater ac- 
curacy than on pages 683 and 684 the significance of the two 
main principles of classification. For we now have a better 
yardstick, a better measure of the magnitude of variations 
due to the play of “chance.” The variability within cells 
(variance = 393.48) is a better criterion of the magni- 
tude of sampling errors than is the variability within the 
perishable and durable classes (variance = 448.45) or the 
variability within the raw and manufactured classes (vari- 
ance = 439.39). For the variance within the four cells is 
free of the influence of forces connected with either of the 
specified principles of classification. 

This more accurate test of the perishable-durable prin- 
ciple of classification is applied by the customary method. 


Nature of Degrees of 

variability freedom 

Between perishable-durable 
classes 1 

Within cells 666 


Variance 

<r l 


Log,o •* 


29,463.31 10.290900 

393.48 5.975035 


Diff. - 4.315865 
z = 2.16 
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The 1 per cent value of z is approximately .95, and the 
above result is clearly significant. The application of the 
perishable-durable principle of classification, under the con- 
ditions represented in Table I, yields classes of commodities 
that differed significantly in their price changes between 
1926 and February, 1933. It is important to note that raw 
and manufactured goods are present in the perishable and 
durable groups in precisely the same proportions. One fifth 
of the commodities in each group are raw materials and 
four fifths are manufactured goods. Thus behavior peculiar 
to raw materials may be expected to influence the two 
groups in precisely the same degree; the same is true of 
behavior peculiar to goods in the manufactured state. 1 It is 
necessary, for this reason, that the frequencies in the several 
classes be proportional in the application of the tests here 
discussed, when two principles of classification are jointly 
employed. 2 

A test of the significance of the raw-manufactured prin- 
ciple of classification may be applied in the same way. 
The variance within cells is employed as yardstick, as in the 
preceding example. Here, also, proportionality is necessary, 
with raw and manufactured goods being divided in the same 
proportions into perishable and durable sub-groups. The 
test reveals a significant difference in price behavior between 
raw and manufactured commodities. 

A TEST OF THE INTERACTION 

Not all the variability between cells is explained by the 
two major classifications we have just discussed. The 
residual variability between cells, or the interaction, amounts 
to 1,992.54, in terms of squared deviations (see Table J). 

1 See below, however, for a test of the significance of the interaction. 

2 For a discussion of procedures appropriate to cases in which cell frequencies 
are not proportional see 

Yates, F. Journal of Agricultural Science , Vol. 23, 108 (1933). 

Snedocor, G. W. and Cox, G. M. Iowa Agricultural Experiment Station 
Bulletin 180 (1935). 
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This may be derived readily by subtracting from the total 
variability between cells (66,970. 60) the sum of the variabil- 
ity between perishable-durable classes (29,463.31) and the 
variability between raw-manufactured classes (35,514.75). 
The number of degrees of freedom in the interaction may be 
determined by the same process of subtraction. In the pres- 
ent instance it is 1. 

This residual variability may represent “experimental 
error, ” the play of the same chance forces that are measured 
by the variability within cells. The residual variability 
was used, in the last example cited in Chapter XV, as a yard- 
stick defining the magnitude of fluctuations due to chance. 
It is proper to assume that this is the case when the two 
major principles of classification are quite independent of 
one another. But if these principles are correlated, the re- 
sidual variability reflects the interaction of the two prin- 
ciples of classification — the differential behavior of given 
classes of goods under the influence of forces related to the 
other principle of classification. Thus it may be that the 
difference between raw perishable and manufactured perish- 
able goods is not the same as the difference between raw 
durable and manufactured durable goods. The process of 
fabrication applied to perishable goods may produce results 
(in the form of price behavior) different from those produced 
when the process of fabrication is applied to durable goods. 
Perishable and durable goods may respond differently, as 
regards their price behavior, to the influence of fabrication. 
Such differential behavior of categories of goods under the 
influence of the same treatment (i.e., fabrication) is meas- 
ured by the interaction. 

If there is no such differential behavior, in a given ex- 
periment, the residual variability between cells will be of 
the same order of magnitude as the variability within cells, 
when account is taken of number of degrees of freedom. A 
test is applied on page 690. 

If we judge this result with reference to the 1 per cent 
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Nature of Degrees of Sum of Variance 

variability freedom squares <r 2 "* 

Interaction (residual vari- 
ability between cells) 1 1,922.54 1,922.54 7.561429 

Within cells 666 262,059.28 393.48 5.975035 

Diff. = 1.586394 
*= .79 

value of z (.95), we would conclude that the residual vari- 
ability between cells is attributable to the play of chance 
rather than to any true interaction. For although the resid- 
ual variability is greater than the variance within cells 
which we use as yardstick, the excess is not clearly too great 
to be attributed to chance. Reference to the 5 per cent 
value of z (.675, for n\ = 1, n 2 = 666) throws more light 
on the situation. Less frequently than 5 times out of 100 
would the play of chance alone give us a measure of resid- 
ual variability as great as that here obtained. For the z 
of .79 is greater than the 5 per cent value, .675. In such a 
case as this, where P falls between . 01 and . 05, the evidence 
is not conclusive. There is, however, a strong indication 
that perishable and durable goods respond differently, in 
their price behavior, to the process of fabrication. Reference 
to Table I will show that among both perishable and dur- 
able goods fabrication appears to have reduced susceptibility 
to price decline under the force of business recession. M 2 
is distinctly greater than Mi, and M 4 is greater than M%. 
But the influence of fabrication was apparently greater 
among perishable than among durable goods. 1 Our test 
shows that the degree of difference between the two reduc- 
tions (i.e., reductions in degree of price decline) is almost 
too great to be attributed to chance. The evidence of differ- 
ential behavior is strong enough to justify further investi- 
gation. 

1 The statistical evidence does not, of course, yield information as to the 
nature of the causal relations involved. The test here applied, if positive, 
reveals the presence of interaction, but does not show how the forces involved 
interact to bring about the observed differential behavior. The text is to be 
read with this qualification in mind. 
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GLOSSARY OF SYMBOLS 

The following are the more important symbols employed 
in the preceding pages. Those of which limited use is made, 
for special purposes, are not here included. A given symbol 
is sometimes called upon to serve different purposes, but the 
precise meaning should be clear from the context. 

1. General symbols for variables and constants: 

x: a variable quantity. 

y: a variable quantity. 

In general, any letter near the end of the alphabet may 
be employed to represent a variable quantity. Different 
variable quantities may be represented by the use of a 
single symbol, with different subscripts, as X h X 2 , X z , 
or W i, W 2 , W 3 . [A distinction is later drawn (cf. Sym- 
bols employed in the measurement of relationship) 
between capital letters and small letters, as used to 
represent variable quantities.] 

a: a constant (i.e., a quantity the value of which does not 
change in the given discussion). In general, any letter 
near the beginning of the alphabet may be used to 
represent a constant. 

2. Symbols employed in the analysis and description of 

the frequency distribution: 

m: the value of an individual observation; the value of the 
mid-point of a class. (The symbols ai, a 2 , a 3 are some- 
times employed to represent different observations in 
a series.) 

f: the number of observations in a given class; the frequency 
of a given class. 

i: the class-interval. 

I : the lower limit of a class. 

N: the total number of cases in a given series or frequency 
distribution. 
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d: the deviation of a given observation from an average; 
usually, a deviation from the arithmetic mean. When 
written with a subscript, as d x or d y , it refers to a devia- 
tion from the arithmetic mean of the variable repre- 
sented by the subscript. The symbol d is sometimes 
used to designate the difference between mean and 
mode. 

d f : the deviation of a given observation from an arbitrary 
origin, or assumed mean. 

c: the difference between an arbitrary origin, or assumed 
mean, and the true mean (in terms of the symbols ex- 
plained below, c = AT — M r ). 

2 (Sigma): the symbol for the process of summation. Thus 
2 d means the sum of all the deviations. 

101, W2 , t03* weights attached to a series of measures being 
averaged. (Not to be confused with similar symbols 
used to represent different variable quantities.) 

y 0 : the maximum ordinate of a frequency curve. 

Symbols for averages , quartiles , etc.: 

M: the arithmetic mean. 

Md.: the median. 

Mo.: the mode. 

M g : the geometric mean. 

H: the harmonic mean. 

M': the value of an assumed arithmetic mean. 

Qi: the first or lower quartile. 

Q 2 * the second quartile or median. 

Q3‘ the third or upper quartile. 

K: the value of a point midway between the first and third 
quartiles. 

Z) 3 * the third decile. 

Symbols for measures of variation and skewness: 

M.D.: the mean deviation. 

a: the standard deviation; the root-mean-square deviation 
about the arithmetic mean. 

a': the standard deviation of proportions, or relative fre- 
quencies. 

s a : the root-mean-square deviation about an origin other 
than the arithmetic mean. 

P.E.: the probable error. 
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Q.D.: the quartile deviation. 

qi: the difference between the median and the lower quartile 
(Md. - < 20 . 

^2* the difference between the upper quartile and the median 

(Qi - Md.). 


V: the coefficient of variation. 
sk: a measure of skewness. 

X (Chi): a measure of skewness based upon the criteria j8i 
and 182. 


Symbols for moments and criteria of curve type . 

Vi, V2, v 3) etc.: moments of a frequency distribution about an 
arbitrary origin. 

tti, 7t 2 , 7 r 3 , etc.: uncorrected moments of a frequency distribu- 
tion about the arithmetic mean. 

Pi, M2, M3, etc. : moments of a frequency distribution about the 
arithmetic mean after the application of Sheppard's 
corrections. 


ft: 


P± 

M2 2 


k 2 : A criterion of curve type based on / 3 i and / 3 2 . 


3 . Symbols relating to index numbers. 

p 0 ': price of a given commodity at time “0" (the base period). 
q 0 quantity of same commodity at time “0". 
pi': price of same commodity at time “1". 
qf: quantity of same commodity at time “1". 
p 0 ": price of a second commodity at time “0". 
qf': quantity of second commodity at time “0". 
pi"; price of second commodity at time “1". 

qi": quantity of second commodity at time “1". 

/ 

a price relative (relation of price of a given commodity 
p° 

at time “1 " to price of same commodity at time “0"). 

/ 

— f : a quantity relative, 

go 
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4. Symbols employed in the measurement of relationship. 
X: an observed value of a variable quantity. 

Y: an observed value of a variable quantity. (The observed 
values of different variables may be represented also by 
_ the symbols X h X 2j X 3 , or W h W 2 , TF 3 .) 

X: the arithmetic mean of a number of observed values of 
the variable X . A similar symbol may be employed for 
other variables. (In one demonstration in the preceding 
pages, relating to multiple correlation, the symbols A u 
A 2) . . . are used to represent the arithmetic means 

of the variables X\ 9 X 2y X 2 . . . . The symbols M x 
and M v are occasionally employed to designate the 
arithmetic means of different variables.) 
x: value of a variable quantity expressed as a deviation 
from the arithmetic mean of all the observed values. 
The symbol y and the symbols x h x 2) . . . are 

similarly employed with respect to variables repre- 
sented, as to original observations, by the symbols 

y, x u x 2j x 3 . . . . 

X': a value of a variable quantity expressed as a deviation, 
in class-interval units, from an arbitrary origin. The 
symbol Y' has a similar meaning. 

X": a value of a variable quantity expressed as a deviation, 
in original units, from an arbitrary origin. The sym- 
bol y" has a similar meaning. 

Y e : the computed or estimated value of a variable, as de- 
termined from an equation of average relationship; 
the symbol y c may be employed for such a computed 
value, expressed as a deviation from the mean. 
p: the mean product of two variables when expressed as 
deviations from their respective arithmetic means, i.e., 
2 (xif) 

p = When written with subscripts, as pu, the 

latter relate to the variables in question, as X\, Xt. 
p': the mean product of two variables when expressed as 
deviations from assumed arithmetic means, 
r: the Pearsonian coefficient of correlation. When written 
with subscripts, the latter indicate the variables to 
which the coefficient relates. Thus r yx refers to the 
variables y and x, and r l2 refers to the variables Xi 
and Xt. 

p (rho): a general index of correlation. Subscripts should be 
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employed to indicate the variables to which the meas- 
ure relates, as p VX) p*i/, pi og yXf piog y log Pi > etc. 

v 9 

(In each case the first subscript relates to the depend- 
ent variable.) 

p: a corrected index of correlation. 

d: the deviation of a given observation from a fitted curve; 
the difference between an observed and a corresponding 
computed value of a variable. 

v: a residual; identical in meaning with d 9 as given above. 

S: the root-mean-square deviation about a fitted curve; 
the standard error of estimate. This measure should 
be written with a subscript to indicate the variable to 
which it applies, as S y , S x , Siog y (the standard error of 
estimate in terms of logarithms), S r (the standard error 
of estimate in terms of ratios), Si (the standard error 

V 

_ of estimate in terms of reciprocals). 

S: a corrected standard error of estimate. 

p r : the coefficient of rank correlation. 
ri (eta): the correlation ratio. Subscripts should be employed 
to represent the variables to which the measure re- 
lates, as 7] yx or rj xy . The first subscript in each case 
relates to the dependent variable. 

rj: a corrected correlation ratio. 
a ay : the root-mean-square deviation about a line through the 
means of the columns of a correlation table; the stand- 
ard deviation of the y -& rrays about their respective 
means. The symbol a ax has the same meaning with 
respect to the rows of a correlation table, or the 
x-arrays. 

am,,: the standard deviation of the means of the columns of 
a correlation table about the mean of all the y 1 s, the 
mean of each column being weighted by the number of 
items in that column. The symbol a mx has the same 
meaning with respect to the means of the rows, 
f (zeta): the test for linearity of regression (f = V 2 — r 2 ). 

m: the number of arrays employed in the computation of a 
given correlation ratio; also, the number of constants 
in the equation defining a curvilinear or multiple 
regression. 

b: the coefficient of regression; the slope of a line of regres- 
sion. When written with subscripts, the latter relate 
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to the variables in question, as b vx , bu (for the variables 
Xi, x 2 ). The first subscript relates to the dependent 
variable in each case; b yx is the coefficient of regression 
of y on x and bxy is the coefficient of regression of x on y . 
z: a logarithmic transformation of the coefficient of corre- 
lation. z = i{loge (1 + r) - loge (1 - r)}. 

R 1.234: the coefficient of multiple correlation between a de- 
pendent variable, x\ 9 and a combination of independent 
variables, x 2 , x 3 , and x 4 . The order may be changed, 
but the primary subscript always relates to the de- 

_ pendent variable. 

fii.234* a corrected coefficient of multiple correlation. 

7*12.34* the coefficient of partial or net correlation between the 
variables x\ and x 2 , when the variables x 3 and x 4 are 
held constant. The order of subscripts is changed for a 
different combination of variables, the two primary 
subscripts always relating to the variables between 
which the net correlation is being measured. 

612.34* the coefficient of net regression between the variables 
X\ and x 2 , the former being dependent, when the vari- 
ables Xz and x 4 are also taken account of in the estimat- 
ing equation; the weight given to x 2 in estimating Xi, 
when the estimate is also based upon values of x 3 and 
x 4 . The order of subscripts is changed for a different 
combination of variables. 

iSi.234* the root-mean-square deviation about a line describing 
the relationship between a dependent variable, X\, and 
a series of independent variables, x 2) x 3 , and x 4 ; the 
standard error of estimate of xi under these conditions. 

a 1.234* the standard deviation of the fourth order; identical 
with Si. 234* 

012.34* a coefficient of partial regression in an equation relating 
to variables expressed in standard deviation units. 

(In the seven measures immediately above, the number of subscripts corre- 
sponds to the number of variables included in a given study. For the sake of sim- 
plicity, only four variables have been assumed.) 

5. Symbols employed in the measurement of errors. 

<r: the standard deviation of a parent population. 

<*m or a*: the standard error of a mean, derived from a knowledge 
of the <r of the population. 

5: the standard deviation of a sample. 



GLOSSARY OF SYMBOLS 


697 


Sm or s%: the estimated standard error of a mean, in the deriva- 
tion of which $ is used as an approximation to <r. 

T: the deviation of a given statistical measurement from 
the mean of a normal distribution, expressed in units 
of the standard deviation of that distribution; a normal 
deviate. 

t: the deviation of a given statistical measurement from a 
hypothetical value, expressed in units of the estimated 
standard error of the measurement in question. 

cm*: the standard error of the mean of a stratified sample. 

D: a difference between two means. 

< td : the standard error of the difference between two means. 

D p : a difference between two percentages. 

<r Dp : The standard error of the difference between two per- 
centages. 

D z : the difference between two logarithmic transformations 
of the coefficient of correlation. 

a d z : the standard error of D*. 

<r, with any subscript, is used to represent the standard 
error of the measure to which the subscript relates. 

P.E. with any subscript is used to represent the prob- 
able error of the measure to which the subscript re- 
lates (P.E. = . 67449cr). 

(Tbi-bi-' the standard error of the difference between two coeffi- 
cients of regression. 

6. Symbols employed in the analysis of variance. 

z: the difference between the natural logarithms of two 
standard deviations. 

<r z : the standard error of z. 

m: the number of degrees of freedom in the larger of two 
variances being compared. 

n 2 : the number of degrees of freedom in the smaller of two 
variances being compared. 

7. Other symbols. 

p: the probability of a successful outcome of a given 
event. 

q: the probability of an unsuccessful outcome of a given 
event. 

n: the number of independent events in a given trial. 
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X®* a quantity used in testing hypotheses involving the 
computation of theoretical frequencies; x® defines the 
relative magnitude of the differences between observed 
and theoretical frequencies. 


Greek Alphabet 


Letters 

Names 

Letters 

Names 

Letters 

Names 

A a 

Alpha 

I t 

Iota 

PP 

Rho 


Beta 

K K 

Kappa 

2 cr 

Sigma 

r 7 

Gamma 

A X 

Lambda 

T T 

Tau 

A 8 

Delta 

M/i 

Mu 

T v 

Upsilon 

£ € 

Epsilon 

N */ 

Nu 

<I> 

Phi 

Z £ 

Zeta' 

S £ 

Xi 

Xx 

Chi 

H v 

Eta 

O o 

Omicron 


Psi 

0 e 

Theta 

II 7 r 

Pi 

O ft) 

Omega 
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Areas of the Normal Curve of Error in Terms of Abscissa 


HI b 

.00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.09 

0.0 

.00000 

.00399 

.00798 

.01197 

.01595 

.01994 

.02392 

.02790 

.03188 

.03586 

0.1 

.03983 

.04380 

.04770 

.05172 

.05567 

.05962 

.06356 

.06749 

.07142 

.07535 

0.2 

.07920 

.08317 

.08706 

.09095 

.09483 

.09871 

. 10257 

.10642 

.11026 

.11409 

0.3 

.11791 

. 12172 

. 12552 

. 12930 

. 13307 

.13683 

. 14058 

. 14431 

.14803 

. 15173 

0.4 

.15542 

.15910 

. 16276 

.16640 

.17003 

.17364 

.17724 

.18082 

.18439 

.18793 

0.5 

. 19146 

.19497 

.19847 

.20194 

.20540 

.20884 

.21226 

.21566 

.21904 

.22240 

0.6 

.22575 

.22907 

.23237 

.23565 

.23891 

.24215 

.24537 

.24857 

.25175 

.25490 

0.7 

.25804 

.26115 

.26424 

.26730 

.27035 

.27337 

.27037 

.27935 

.28230 

.28524 

0.8 

.28814 

.29103 

.29389 

.29673 

.29955 

.30234 

.30511 

.30785 

.31057 

.31327 

0.9 

.31594 

.31859 

.32121 

.32381 

.32039 

.32894 

.33147 

.33398 

.33646 

.33891 

1.0 

.34134 

.34375 

.34614 

.34850 

.35083 

.35314 

.35543 

.35769 

.35993 

.36214 

1.1 

.30433 

.36650 

.36864 

.37070 

.37286 

.37493 

.37098 

.37900 

.38100 

.38298 

1.2 

.38493 

.38686 

.38877 

.39065 

.39251 

.39435 

.39017 

.39790 

.39973 

.40147 

1.3 

.40320 

.40490 

.40658 

.40824 

.40988 

.41149 

.41309 

.41466 

.41621 

.41774 

1.4 

.41924 

.42073 

.42220 

.42364 

.42507 

.42647 

.42780 

.42922 

.43056 

.43189 

1.5 

.43319 

.43448 

.43574 

.43699 

.43822 

.43943 

.44062 

.44179 

.44295 

.44408 

1.0 

.44520 

.44630 

.44738 

.44845 

.44950 

.45053 

.45154 

.45254 

.45352 

.45449 

1.7 

.45543 

.45637 

.45728 

.45818 

.45907 

.45994 

.40080 

.40104 

.46246 

.40327 

1.8 

.46407 

.46485 

.46562 

.46638 

.46712 

.40784 

.46856 

.46926 

.46995 

.47062 

1.9 

.47128 

.47193 

.47257 

.47320 

.47381 

.47441 

.47500 

.47558 

.47615 

.47070 

2.0 

.47725 

.47778 

.47831 

.47882 

.47932 

.47982 

.48030 

.48077 

.48124 

.48169 

2.1 

.48214 

.48257 

.48300 

.48341 

.48382 

.48422 

.48461 

.48500 

.48537 

.48574 

2.2 

.48610 

.48645 

.48679 

.48713 

.48745 

.48778 

.48809 

.48840 

.48870 

.48899 

2.3 

.48928 

.48950 

.48983 

.49010 

.49036 

.49061 

.49080 

.49111 

.49134 

.49158 

2.4 

.49180 

.49202 

.49224 

.49245 

.49206 

.49286 

.49305 

.49324 

.49343 

.49301 

2.5 

.49379 

.49390 

.49413 

.49430 

.49446 

.49461 

.49477 

.49492 

.49506 

.49520 

2.0 

.49534 

.49547 

.49560 

.49573 

.49585 

.49598 

.49609 

.49621 

.49032 

.49643 

2.7 

.49053 

.49664 

.49674 

.49083 

.49093 

.49702 

.49711 

.49720 

.49728 

.49730 

2.8 

.49744 

.49752 

.49760 

.49707 

.49774 

.49781 

.49788 

.49795 

.49801 

.49807 

2.9 

.49813 

.49819 

.49825 

.49831 

.49836 

.49841 

.49846 

.49851 

.49856 

.49861 

3.0 

.49865 

.49869 

.49874 

.49878 

.49882 

.49886 

.49889 

.49893 

.49897 

.49900 

3.1 

.49903 

.49906 

.49910 

.49913 

.49916 

.49918 

.49921 

.49924 

.49926 

.49929 

3.2 

.49931 

.49934 

.49936 

.49938 

.49940 

.49942 

.49944 

.49946 

.49948 

.49950 

3.3 

.49952 

.49953 

.49955 

.49957 

.49958 

.49960 

.49961 

.49902 

.49964 

.49965 

3.4 

.49906 

.49968 

.49969 

.49970 

.49971 

.49972 

.49973 

.49974 

.49975 

.49976 


689 
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Table of t 


n 

P = .05 

.02 

.01 

1 

12.706 

31.821 

63.657 

2 

4.303 

6.965 

9.925 

3 

3.182 

4.541 

5.841 

4 

2.776 

3.747 

4.6Q4 

5 

2.571 

3.365 

4.032 

6 

2.447 

3.143 

3.707 

7 

2.365 

2.998 

3.499 

8 

2.306 

2.896 

3.355 

9 

2.262 

2.821 

3.250 

10 

2.228 

2.764 

3.169 

11 

2.201 

2.718 

3.106 

12 

2.179 

2.681 

3.055 

13 

2.160 

2.650 

3.012 

14 

2.145 

2.624 

2.977 

15 

2.131 

2.602 

2.947 

16 

2.120 

2.583 

2.921 

17 

2.110 

2.567 

2.898 

18 

2.101 

2.552 

2.878 

19 

2.093 

2.539 

2.861 

20 

2.086 

2.528 

2.845 

21 

2.080 

2.518 

2.831 

22 

2.074 

2.508 

2.819 

23 

2.069 

2.500 

2.807 

24 

2.064 

2.492 

2.797 

25 

2.060 

2.485 

2.787 

26 

2.056 

2.479 

2.779 

27 

2.052 

2.473 

2.771 

28 

2.048 

2.467 

2.763 

29 

2.045 

2.462 

2.756 

30 

2.042 

2.457 

2.750 

00 

1.95996 

2.32634 

2.57582 


1 Excerpts from Table IV, R. A. Fisher, Statistical Methods for Research 
Workers . These excerpts are printed here through the courtesy of Dr. Fisher 
and his publishers, Oliver and Boyd, of Edinburgh. 
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Appendix Table III 1 


Values of the Correlation Coefficient for Different Levels of 
Significance 


n 

P = .05 

.02 

.01 

1 

.996917 

.9995066 

. 9998766 

2 

.95000 

.98000 

.990000 

3 

.8783 

.93433 

.95873 

4 

.8114 

.8822 

.91720 

5 

.7545 

.8329 

.8745 

6 

.7067 

.7887 

.8343 

7 

.6664 

.7498 

.7977 

8 

.6319 

.7155 

.7646 

9 

.6021 

.6851 

.7348 

10 

.5760 

.6581 

.7079 

11 

.5529 

.6339 

.6835 

12 

.5324 

.6120 

.6614 

13 

.5139 

.5923 

.6411 

14 

.4973 

.5742 

.6226 

15 

.4821 

.5577 

.6055 

16 

.4683 

.5425 

.5897 

17 

.4555 

.5285 

.5751 

18 

.4438 

.5155 

.5614 

19 

.4329 

.5034 

.5487 

20 

.4227 

.4921 

.5368 

25 

.3809 

.4451 

.4869 

30 

.3494 

.4093 

.4487 

35 

.3246 

.3810 

.4182 

40 

.3044 

.3578 

.3932 

45 

.2875 

.3384 

.3721 

50 

.2732 

.3218 

.3541 

60 

.2500 

.2948 

.3248 

70 

.2319 

.2737 

.3017 

80 

.2172 

.2565 

.2830 

90 

.2050 

.2422 

.2673 

100 

.1946 

.2301 

.2540 


For a total correlation, n is 2 less than the number of pairs in the 
sample; for a partial correlation, the number of eliminated variates also 
should be subtracted. 


1 Excerpts from Table V-A, R. A. Fisher, Statistical Methods for Research 
Workers. These excerpts are printed here through the courtesy of Dr. Fisher 
and his publishers, Oliver and Boyd, of Edinburgh. 
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Appendix Table IV 


Showing the Relations between r and e for Values of s from 0 to 5 l 


z 

.00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.09 

.0 

.0000 

.0100 

.0200 

.0300 

.0400 

.0500 

.0599 

.0699 

.0798 

.0898 

.1 

.0997 

.1096 

.1194 

.1293 

.1391 

.1489 

.1587 

.1684 

.1781 

.1878 

.2 

.1974 

.2070 

.2165 

.2260 

.2355 

.2449 

.2543 

.2636 

.2729 

.2821 

.3 

.2913 

.3004 

.3095 

.3185 

.3275 

.3364 

.3452 

.3540 

.3627 

.3714 

.4 

.3800 

.3885 

.3969 

.4053 

.4136 

.4219 

.4301 

.4382 

.4462 

.4542 

.5 

.4621 

.4700 

.4777 

.4854 

.4930 

.5005 

.5080 

.5154 

.5227 

.5299 

.6 

.5370 

.5441 

.5511 

.5581 

.5649 

.5717 

.5784 

.5850 

.5915 

.5980 

.7 

.6044 

.6107 

.6169 

.6231 

.6291 

.6352 

.6411 

.6469 

.6527 

.6584 

.8 

.6640 

.6696 

.6751 

.6805 

.6858 

.6911 

.6963 

.7014 

.7064 

.7114 

.9 

.7163 

.7211 

.7259 

.7306 

.7352 

.7398 

.7443 

.7487 

.7531 

.7574 

1.0 

.7616 

.7658 

.7699 

.7739 

.7779 

.7818 

.7857 

.7895 

.7932 

.7969 

1.1 

.8005 

.8041 

.8076 

.8110 

.8144 

.8178 

.8210 

.8243 

.8275 

.8306 

1.2 

.8337 

.8367 

.8397 

.8426 

.8455 

.8483 

.8511 

.8538 

.8565 

.8591 

1.3 

.8617 

.8643 

.8668 

.8693 

.8717 

.8741 

.8764 

.8787 

.8810 

.8832 

1.4 

.8854 

.8875 

.8896 

.8917 

.8937 

.8957 

.8977 

.8996 

.9015 

.9033 

1.5 

.9052 

.9069 

.9087 

.9104 

.9121 

.9138 

.9154 

.9170 

.9186 

.9202 

1.6 

.9217 

.9232 

.9246 

.9261 

.9275 

.9289 

.9302 

.9316 

.9329 

.9342 

1.7 

.9354 

.9367 

.9379 

.9391 

.9402 

.9414 

.9425 

.9436 

.9447 

.9458 

1.8 

.9468 

.9478 

.9498 

.9488 

.9508 

.9518 

.9527 

.9536 

.9545 

.9554 

1.9 

.9562 

.9571 

.9579 

.9587 

.9595 

.9603 

.9611 

.9619 

.9626 

.9633 

2.0 

.9640 

.9647 

.9654 

.9661 

.9668 

.9674 

.9680 

.9687 

.9693 

.9699 

2.1 

.9705 

.9710 

.9716 

.9722 

.9727 

.9732 

.9738 

.9743 

.9748 

.9753 

2.2 

.9757 

.9762 

.9767 

.9771 

.9776 

.9780 

.9785 

.9789 

.9793 

.9797 

2.3 

.9801 

.9805 

.9809 

.9812 

.9816 

.9820 

.9823 

.9827 

.9830 

.9834 

2.4 

.9837 

.9840 

.9843 

.9846 

.9849 

.9852 

.9855 

.9858 

.9861 

.9863 

2.5 

.9866 

.9869 

.9871 

.9874 

.9876 

.9879 

.9881 

.9884 

.9886 

.9888 

2.6 

.9890 

.9892 

.9895 

.9897 

.9899 

.9901 

.9903 

.9905 

.9906 

.9908 

2.7 

.9910 

.9912 

.9914 

.9915 

.9917 

.9919 

.9920 

.9922 

.9923 

.9925 

2.8 

.9926 

.9928 

.9929 

.9931 

.9932 

.9933 

.9935 

.9936 

.9937 

.9938 

2.9 

.9940 

.9941 

.9942 

.9943 

.9944 

.9945 

.9946 

.9947 

.9949 

.9950 


3.0 .9951 

4.0 .9993 

6.0 .9999 

1 The figures in the body of the table are values of r corresponding to 
2 -values read from the scales on the left and top of the table. 
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Appendix Table V 1 


Table of x 2 


n 

P = .99 

1 

.000157 

2 

.0201 

3 

.115 

4 

.297 

5 

.554 

6 

.872 

7 

1.239 

8 

1.646 

9 

2.088 

10 

2.558 

11 

3.053 

12 

3.571 

13 

4.107 

14 

4.660 

15 

5.229 

16 

5.812 

17 

6.408 

18 

7.015 

19 

7.633 

20 

8.260 


.95 .50 


.00393 

.455 

.103 

1.386 

.352 

2.366 

.711 

3.357 

1.145 

4.351 

1.635 

5.348 

2.167 

6.346 

2.733 

7.344 

3.325 

8.343 

3.940 

9.342 

4.575 

10.341 

5.226 

11.340 

5.892 

12.340 

6.571 

13.339 

7.261 

14.339 

7.962 

15.338 

8.672 

16.338 

9.390 

17.338 

10.117 

18.338 

10.851 

19.337 


.10 

.05 

2.706 

3.841 

4.605 

5.991 

6.251 

7.815 

7.779 

9.488 

9.236 

11.070 

10.645 

12.592 

12.017 

14.067 

13.362 

15.507 

14.684 

16.919 

15.987 

18.307 

17.275 

19.675 

18.549 

21.026 

19.812 

22.362 

21.064 

23.685 

22.307 

24.996 

23.542 

26.296 

24.769 

27.587 

25.989 

28.869 

27.204 

30.144 

28.412 

31.410 


.02 

.01 

5.412 

6.635 

7.824 

9.210 

9.837 

11.341 

11.668 

13.277 

13.388 

15.086 

15.033 

16.812 

16.622 

18.475 

18.168 

20.090 

19.679 

21.666 

21.161 

23.209 

22.618 

24.725 

24.054 

26.217 

25.472 

27.688 

26.873 

29.141 

28.259 

30.578 

29.633 

32.000 

30.995 

33.409 

32.346 

34.805 

33.687 

36.191 

35.020 

37.566 

36.343 

38.932 

37.659 

40.289 

38.968 

41.638 

40.270 

42.980 

41.566 

44.314 

42.856 

45.642 

44.140 

46.963 

45.419 

48.278 

46.693 

49.588 

47.962 

50.892 

— 1 may 

be used 


21 

8.897 

11.591 

20.337 

29.615 

32.671 

22 

9.542 

12.338 

21.337 

30.813 

33.924 

23 

10.196 

13.091 

22.337 

32.007 

35.172 

24 

10.856 

13.848 

23.337 

33.196 

36.415 

25 

11.524 

14.611 

24.337 

34.382 

37.652 

26 

12.198 

15.379 

25.336 

35.563 

38.885 

27 

12.879 

16.151 

26.336 

36.741 

40.113 

28 

13.565 

16.928 

27.336 

37.916 

41.337 

29 

14.256 

17.708 

28.336 

39.087 

42.557 

30 

14.953 

18.493 

29.336 

40.256 

43.773 


For larger values of n, the expression V2X 2 — y/2n 
as a normal deviate with unit standard error. 


1 Excerpts from Table III, R. A. Fisher, Statistical Methods for Research 

Workers. These excerpts are printed here through the courtesy of Dr. Fisher 
and his publishers, Oliver and Boyd, of Edinburgh. 
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Appendix Table VI 1 

1 Per Cent Points of the Distribution of z 



Values of m 

1. 

2. 

3. 

4. 

5. 

6. 

8. 

12. 

24. 

«o. 


1 

4.1535 

4.2585 

4 2974 

4.3175 

4.3297 

4.3379 

4.3482 

4.3585 

4.3689 

4.3794 


2 

2.2950 

2.2976 

2.2984 

2.2988 

2.2991 

2.2992 

2.2994 

2.2997 

2.2999 

2.3001 


3 

1 . 7649 

1.7140 

1.6915 

1.6786 

1.6703 

1.6645 

1.6569 

1.6489 

1.6404 

1 6314 


4 

1.5270 

1.4452 

1.4075 

1 . 3856 

1.3711 

1.3609 

1.3473 

1.3327 

1.3170 

1.3000 


5 

1 . 3943 

1 . 29)29 

1.2449 

1 2164 

1 1974 

1 . 1838 

1 . 1644 

1 . 1457 

1 . 1239 

1.0997 


6 

1.3103 

1.1955 

1.1401 

1 . 1068 

1.0843 

1.0680 

1.0460 

1.0218 

.9948 

.9643 


7 

1.2526 

1.1281 

1.0672 

1.0300 

1.0048 

.9864 

.9614 

.9335 

.9020 

.8658 


8 

1.2106 

1.0787 

1.0135 

.9734 

.9459 

.9259 

.8983 

.8673 

.8319 

.7904 


9 

1.1786 

1.0411 

.9724 

.9299 

.9006 

.8791 

.8494 

.8157 

.7769 

.7305 


10 

1.1535 

1.0114 

.9399 

.8954 

.8646 

.8419 

.8104 

.7744 

.7324 

.6816 


11 

1.1333 

.9874 

.9136 

.8674 

.8354 

.8116 

.7785 

.7405 

.6958 

.6408 


12 

1.1166 

.9677 

.8919 

.8443 

.8111 

.7864 

.7520 

.7122 

.6649 

.6061 


13 

1 . 1027 

.9511 

.8737 

.8248 

.7907 

.7652 

.7295 

.6882 

.6386 

.6761 


14 

1.0909 

.9370 

.8581 

.8082 

.7732 

.7471 

.7103 

.6675 

.6159 

.6500 


15 

1 0807 

.9249 

.8448 

.7939 

.7582 

.7314 

.6937 

.6496 

.5961 

.5269 


16 

1.0719 

.9144 

.8331 

.7814 

.7450 

.7177 

.6791 

.6339 

.5786 

.5064 


17 

1.0641 

.9051 

.8229 

.7705 

.7335 

.7057 

.6663 

.6199 

.5630 

.4879 

§ 

18 

1 0572 

.8970 

.8138 

.7607 

.7232 

.6950 

.6549 

.6075 

.5516 

.4712 

•3 

19 

1.0511 

.8897 

.8057 

.7521 

.7140 

.6854 

.6447 

.5964 

.5366 

.4560 

£ 

20 

1.0457 

.8831 

.7985 

.7443 

.7058 

.6768 

.6355 

.5864 

.5253 

.4421 


21 

1.0408 

.8772 

.7920 

.7372 

.6984 

.6690 

.6272 

.5773 

.5150 

.4294 


22 

1.0363 

.8719 

.7860 

.7309 

.6916 

.6620 

.6196 

.5691 

.5056 

.4176 


23 

1.0322 

.8670 

.7806 

.7251 

.6855 

.6555 

.6127 

.5615 

.4969 

.4068 


24 

1.0285 

.8626 

.7757 

.7197 

.6799 

.6496 

.6064 

.6545 

.4890 

.3967 


25 

1 0251 

.8585 

.7712 

.7148 

.6747 

.6442 

.6006 

.5481 

.4816 

.3872 


26 

1.0220 

.8548 

.7670 

.7103 

.6699 

.6392 

.5952 

.5422 

.4748 

.3784 


27 

1.0191 

l .8513 

.7631 

.7062 

.6655 

.6346 

.5902 

.6367 

.4685 

.3701 


28 

1.0164 

.8481 

! .7595 

.7023 

.6614 

.6303 

.5856 

.5316 

.4626 

.3624 


29 

1.0139 

.8451 

.7562 

.6987 

.6576 

.6263 

.5813 

.5269 

.4570 

.3550 


30 

1.0116 

.8423 

j .7531 

.6954 

.6540 

.6226 

.5773 

.5224 

.4519 

.3481 


60 

.9784 

.8025 

.7086 

.6472 

.6028 

.5687 

.5189 

.4574 

.3746 

.2352 


00 

.9462 

.7636 

.6651 

.5999 

.5522 

.5152 

4604 

.3908 

.2913 

0 


1 From Table VI, R. A. Fisher, Statistical Methods for Research Workers. 
This table is printed here through the courtesy of Dr. Fisher and his pub- 
lishers, Oliver and Boyd, of Edinburgh. 






Appendix Table VII 1 

5 Per Cent Points of the Distribution of z 



Values of n\ 


1. 

2. 

3. 

4. 

5. 

6. 

8. 

12. 

24. 

oo. 


1 

2.5421 

2.6479 

2 . 6870 

2.7071 

2.7194 

2.7276 

2.7380 

2 7484 

2.7588 

2.7693 


2 

1.4592 

1.4722 

1.4765 

1.4787 

1.4800 

1.4808 

1.4819 

1.4830 

1.4840 

1.4851 


3 

1.1577 

1.1284 

1.1137 

1 1051 

1.0994 

1.0953 

1.0899 

1.0842 

1.0781 

1 0716 


4 

1.0212 

.9690 

.9429 

.9272 

.9168 

.9093 

.8993 

.8885 

.8767 

.8639 


5 

.9441 

.8777 

.8441 

.8236 

.8097 

.7997 

.7862 

.7714 

.7550 

.7368 


6 

.8948 

.8188 

.7798 

.7558 

.7394 

.7274 

.7112 

.6931 

.6729 

6499 


7 

.8606 

.7777 

.7347 

.7080 

.6896 

.6761 

.6576 

.6369 

.6134 

.5862 


8 

.8355 

.7475 

.7014 

.6725 

.6525 

.6378 

.6175 

.5945 

.5682 

5371 


9 

.8163 

.7242 

.6757 

.6450 

.6238 

.6080 

. 5S62 

.5613 

.5324 

.4979 


10 

.8012 

.7058 

.6553 

.6232 

.6009 

.5843 

.5611 

.5346 

.5035 

.4657 


11 

.7889 

.6909 

.6387 

.6055 

.5822 

.5648 

.5406 

.5126 

.4795 

.4387 


12 

.7788 

.6786 

.6250 

.5907 

.5666 

.5487 

.5234 

.4941 

.4592 

.4156 


13 

.7703 

.6682 

.6134 

. 5783 

.5535 

.5350 

.5089 

.4785 

.4419 

.3957 


14 

.7630 

.6594 

.6036 

.5677 

.5423 

.5233 

.4964 

.4649 

.4269 

.3782 


15 

.7568 

.6518 

.5950 

.5585 

.5326 

.5131 

.4855 

.4532 

.4138 

.3628 


16 

.7514 

.6451 

. 5876 

.5505 

.5241 

.5042 

.4760 

.4428 

.4022 

.3490 


17 

.7466 

.6393 

.5811 

.5434 

.5166 

.4964 

.4676 

.4337 

.3919 

.3366 

§ 

18 

.7424 

.6341 

.5753 

.5371 

.5099 

.4894 

.4602 

.4255 

.3827 

.3253 


19 

.7386 

.6295 

.5701 

.5315 

.5040 

.4832 

.4535 

.4182 

.3743 

.3151 


20 

.7352 

.6254 

.5654 

.5265 

.4986 

.4776 

.4474 

.4116 

.3668 

.3057 


21 

.7322 

.6216 

.5612 

.5219 

.4938 

.4725 

.4420 

.4055 

.3599 

.2971 


22 

.7294 

.6182 

.5574 

.5178 

.4894 

.4679 

.4370 

.4001 

.3536 

.2892 


23 

.7269 

.6151 

.5540 

.5140 

.4854 

.4636 

.4325 

.3950 

.3478 

.2818 


24 

.7246 

.6123 

.5508 

.5106 

.4817 

.4598 

.4283 

.3904 

.3425 

.2749 


25 

.7225 

.6097 

.5478 

.5074 

.4783 

.4562 

.4244 

.3862 

.3376 

.2685 


26 

.7205 

.6073 

.5451 

.5045 

.4752 

.4529 

.4209 

.3823 

.3330 

.2625 


27 

.7187 

.6051 

.5427 

.5017 

.4723 

.4499 

.4176 

.3786 

.3287 

.2569 


28 

.7171 

.6030 

.5403 

.4992 

.4696 

.4471 

.4146 

.3752 

.3248 

.2516 


29 

.7155 

.6011 

.5382 

.4969 

.4671 

.4444 

.4117 

.3720 

.3211 

.2466 


30 

.7141 

.5994 

.5362 

.4947 

.4648 

.4420 

.4090 

.3691 

.3176 

.2419 


60 

.6933 

.5738 

.5073 

.4632 

.4311 

.4064 

.3702 

.3255 

.2654 

.1644 


00 

.6729 

.5486 

.4787 

4319 

3974 

3706 

3309 

2804 

.2085 

0 


1 From Table VI, R. A. Fisher, Statistical Methods for Research Workers, 
This table is printed here through the courtesy of Dr. Fisher and his pul> 
lishers, Oliver and Boyd, of Edinburgh. 
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Appendix Table VIII 


Squares of the Natural Numbers from 100 to 999 


-SQUARE OF- 


n 


n 


r 

i + 

1 

n 

t + 

2 

t 

i + 

3 

f 

I + 

4 

r 

1 + 

5 

n 

t + 

6 

n 

1 + 

7 

71 

1 + 

8 

71 

+ 

9 

100 

1 

00 

00 

1 

02 

01 

1 

04 

04 

1 

06 

09 

1 

08 

16 

1 

10 

25 

1 

12 

36 

1 

14 

49 

1 

16 

64 

1 

18 

81 

no 

1 

21 

00 

1 

23 

21 

1 

25 

44 

1 

27 

69 

1 

29 

96 

1 

32 

25 

1 

34 

56 

1 

36 

89 

1 

39 

24 

1 

41 

61 

120 

1 

44 

00 

1 

46 

41 

1 

48 

84 

1 

51 

29 

1 

53 

76 

1 

56 

25 

1 

58 

76 

1 

61 

29 

1 

63 

84 

1 

66 

41 

130 

1 

69 

oo; 

1 

71 

61 

1 

74 

24 

1 

76 

89 

1 

79 

56 

1 

82 

25 

1 

84 

96 

1 

87 

69 

1 

90 

44 

1 

93 

21 

140 

1 

96 

00 

1 

98 

81 

2 

01 

64 

2 

04 

49 

2 

07 

36 

2 

10 

25 

2 

13 

16 

2 

16 

09 

2 

19 

04 

2 

22 

01 

150 

2 

25 

00 

2 

28 

01 

2 

31 

04 

2 

34 

09 

2 

37 

16 

2 

40 

25 

2 

43 

36 

2 

46 

49 

2 

49 

64 

2 

52 

81 

160 

2 

56 

00 

2 

59 

21 

2 

62 

44 

2 

65 

69 

2 

68 

96 

2 

72 

25 

2 

75 

56 

2 

78 

89 

2 

82 

24 

2 

85 

61 

170 

2 

89 

00 

2 

92 

41 

2 

95 

84 

2 

99 

29 

3 

02 

76 

3 

06 

25 

3 

09 

76 

3 

13 

29 

3 

16 

84 

3 

20 

41 

180 

3 

24 

00 

3 

27 

61 

3 

31 

24 

3 

34 

89 

3 

38 

56 

3 

42 

25 

3 

45 

96 

3 

49 

69 

3 

53 

44 

3 

57 

21 

190 

3 

61 

00 

3 

64 

81 

3 

68 

64 

3 

72 

49 

3 

76 

36 

3 

80 

25 

3 

84 

16 

3 

88 

09 

3 

92 

04 

3 

96 

01 

200 

4 

00 

00 

4 

04 

01 

4 

08 

04 

4 

12 

09 

4 

16 

16 

4 

20 

25 

4 

24 

36 

4 

28 

49 

4 

32 

64 

4 

36 

81 

210 

4 

41 

00 

4 

45 

21 

4 

49 

44 

4 

53 

69 

4 

57 

96 

4 

62 

25 

4 

66 

56 

4 

70 

89 

4 

75 

24 

4 

79 

61 

220 

4 

84 

00 

4 

88 

41 

4 

92 

84 

4 

97 

29 

5 

01 

76 

5 

06 

25 

6 

10 

76 

6 

15 

29 

5 

19 

84 

5 

24 

41 

230 

5 

29 

00 

5 

33 

61 

5 

38 

24 

5 

42 

89 

5 

47 

56 

5 

52 

25 

5 

56 

96 

5 

61 

69 

5 

66 

44 

5 

71 

21 

240 

5 

76 

00 

5 

80 

81 

5 

85 

64 

5 

90 

49 

5 

95 

36 

6 

00 

25 

6 

05 

16 

6 

10 

09 

6 

15 

04 

6 

20 

01 

250 

6 

25 

00 

6 

30 

01 

6 

35 

04 

6 

40 

09 

6 

45 

16 

6 

50 

25 

6 

55 

36 

6 

60 

49 

6 

65 

64 

6 

70 

81 

260 

6 

76 

00 

6 

81 

21 

6 

86 

44 

6 

91 

69 

6 

96 

96 

7 

02 

25 

7 

07 

56 

7 

12 

89 

7 

18 

24 

7 

23 

61 

270 

7 

29 

00 

7 

34 

41 

7 

39 

84 

7 

45 

29 

7 

50 

76 

7 

56 

25 

7 

61 

76 

7 

67 

29 

7 

72 

84 

7 

78 

41 

280 

7 

84 

00 

7 

89 

61 

7 

95 

24 

8 

00 

89 

8 

06 

56 

8 

12 

25 

8 

17 

96 

8 

23 

69 

8 

29 

44 

8 

35 

21 

290 

8 

41 

00 

8 

46 

81 

8 

52 

64 

8 

58 

49 

8 

64 

36 

8 

70 

25 

8 

76 

16 

8 

82 

09 

8 

88 

04 

8 

94 

01 

300 

9 

00 

00 

9 

06 

01 

9 

12 

04 

9 

18 

09 

9 

24 

16 

9 

30 

25 

9 

36 

36 

9 

42 

49 

9 

48 

64 

9 

54 

81 

310 

9 

61 

00 

9 

67 

21 

9 

73 

44 

9 

79 

69 

9 

85 

96 

9 

92 

25 

9 

98 

56 

10 

04 

89 

10 

11 

24 

10 

17 

61 

320 

10 

24 

00 

10 

30 

41 

10 

36 

84 

10 

43 

29 

10 

49 

76 

10 

56 

25 

10 

62 

76 

10 

69 

29 

10 

75 

84 

10 

82 

41 

330 

10 

89 

00 

10 

95 

61 

11 

02 

24 

11 

08 

89 

11 

15 

56 

11 

22 

25 

11 

28 

96 

11 

35 

69 

11 

42 

44 

11 

49 

21 

340 

11 

56 

00 

11 

62 

81 

11 

69 

64 

11 

76 

49 

11 

83 

36 

11 

90 

25 

11 

97 

16 

12 

04 

09 

12 

11 

04 

12 

18 

01 

350 

12 

25 

00 

12 

32 

01 

12 

39 

04 

12 

46 

09 

12 

53 

16 

12 

60 

25 

12 

67 

36 

12 

74 

49 

12 

81 

64 

12 

88 

81 

360 

12 

96 

00 

13 

03 

21 

13 

10 

44 

13 

17 

69 

13 

24 

96 

13 

32 

25 

13 

39 

56 

13 

46 

89 

13 

54 

24 

13 

61 

61 

370 

13 

69 

00 

13 

76 

41 

13 

83 

84 

13 

91 

29 

13 

98 

76 

14 

06 

25 

14 

13 

78 

14 

21 

29 

14 

28 

84 

14 

36 

41 

380 

14 

44 

00 

14 

51 

61 

14 

59 

24 

14 

66 

89 

14 

74 

56 

14 

82 

25 

14 

89 

96 

14 

97 

69 

15 

05 

44 

15 

13 

21 

390 

15 

21 

00 

15 

28 

81 

15 

36 

64 

15 

44 

49 

15 

52 

36 

15 

60 

25 

15 

68 

16 

15 

76 

09 

15 

84 

04 

15 

92 

01 

400 

16 

00 

00 

16 

08 

01 

16 

16 

04 

16 

24 

09 

16 

32 

16 

16 

40 

25 

16 

48 

36 

16 

56 

49 

16 

64 

64 

16 

72 

81 

410 

16 

81 

00 

16 

89 

21 

16 

97 

44 

17 

05 

69 

17 

13 

96 

17 

22 

25 

17 

30 

56 

17 

38 

89 

17 

47 

24 

17 

55 

61 

420 

17 

64 

00 

17 

72 

41 

17 

80 

84 

17 

89 

29 

17 

97 

76 

18 

06 

25 

18 

14 

76 

18 

23 

29 

18 

31 

84 

18 

40 

41 

430 

18 

49 

00 

18 

57 

61 

18 

66 

24 

18 

74 

89 

18 

83 

56 

18 

92 

25 

19 

00 

96 

19 

09 

69 

19 

18 

44 

19 

27 

21 

440 

19 

36 

00 

19 

44 

81 

19 

53 

64 

19 

62 

49 

19 

71 

36 

19 

80 

25 

19 

89 

16 

19 

98 

09 

20 

07 

04 

20 

16 

01 

450 

20 

25 

00 

20 

34 

01 

20 

43 

04 

20 

52 

09 

20 

61 

16 

20 

70 

25 

20 

79 

36 

20 

88 

49 

20 

97 

64 

21 

06 

81 

460 

21 

16 

00 

21 

25 

21 

21 

34 

44 

21 

43 

69 

21 

52 

96 

21 

62 

25 

21 

71 

56 

21 

80 

89 

21 

90 

24 

21 

99 

61 

470 

22 

09 

00 

22 

18 

41 

22 

27 

84 

22 

37 

29 

22 

46 

76 

22 

56 

25 

22 

65 

76 

22 

76 

29 

22 

84 

84 

22 

94 

41 

480 

23 

04 

00 

23 

13 

61 

23 

23 

24 

23 

32 

89 

23 

42 

56 

23 

52 

25 

23 

61 

96 

23 

71 

69 

23 

81 

44 

23 

91 

21 

490 

24 

01 

00 

24 

10 

81 

24 

20 

64 

24 

30 

49 

24 

40 

36 

24 

50 

25 

24 

60 

16 

24 

70 

09 

24 

80 

04 

24 

90 

01 

500 

25 

00 

00 

25 

10 

01 

25 

20 

04 

25 

30 

09 

25 

40 

16 

25 

50 

25 

25 

60 

36 

25 

70 

49 

25 

80 

64 

25 

90 

81 

610 

26 

01 

00 

26 

11 

21 

26 

21 

44 

26 

31 

69 

26 

41 

96 

26 

52 

25 

26 

62 

66 

26 

72 

89 

26 

83 

24 

26 

93 

61< 

520 

27 

04 

00 

27 

14 

41 

27 

24 

84 

27 

35 

29 

27 

45 

76 

27 

56 

25 

27 

66 

76 

27 

77 

29 

27 

87 

84 

27 

98 

41 

530 

28 

09 

00 

28 

19 

61 

28 

30 

24 

28 

40 

89 

28 

51 

56 

28 

62 

25 

28 

72 

96 

28 

83 

69 

28 

94 

44 

29 

05 

21 

640 

29 

16 

00 

29 

26 

81 

29 

37 

64 

29 

48 

49 

29 

59 

36 

29 

70 

25! 

29 

81 

16 

29 

92 

09 

30 

03 

04 

30 

14 

oil 
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Appendix Table VIII — Continued 
Squares of the Natural Numbers from 100 to 999 


■SQUARE OF- 


n 


n 


n 

l + 

1 

n 

i + 

2 

7i 

l + 

3 

n 

l + 

4 

n 

l +5 

n + 

6 

r 

t + 

7 

71 

+ 

8 

71 

+ 

9 

650 

30 

25 

00 

30 

36 

01 

30 

47 

04 

30 

58 

09 

30 

69 

16 

30 

80 25 

30 91 

36 

31 

02 

49 

31 

13 

64 

31 

24 

81 

660 

31 

36 

00 

31 

47 

21 

31 

58 

44 

31 

69 

69 

31 

80 

96 

31 

92 25 

32 03 

56 

32 

14 

89 

32 

26 

24 

32 

37 

61 

670 

32 

49 

00 

32 

60 

41 

32 

71 

84 

32 

83 

29 

32 

94 

76 

33 

06 25 

33 17 

76 

33 

29 

29 

33 

40 

84 

33 

52 

41 

680 

33 

64 

00 

33 

75 

61 

33 

87 

24 

33 

98 

89 

34 

10 

56 

34 

22 25 

34 33 

96 

34 

45 

69 

34 

57 

44 

34 

69 

21 

690 

34 

81 

00 

34 

92 

81 

35 

04 

64 

35 

16 

49 

35 

28 

36 

35 

40 25 

35 52 

16 

35 

64 

09 

35 

76 

04 

35 

88 

01 

600 

36 

00 

00 

36 

12 

01 

36 

24 

04 

36 

36 

09 

36 

48 

16 

36 

60 25 

36 72 

36 

36 

84 

49 

36 

96 

64 

37 

08 

81 

610 

37 

21 

00 

37 

33 

21 

37 

45 

44 

37 

57 

69 

37 

69 

96 

37 

82 25 

37 94 

56 

38 

06 

89 

38 

19 

24 

38 

31 

61 

620 

38 

44 

00 

38 

56 

41 

38 

68 

84 

38 

81 

29 

38 

93 

76 

39 

06 25 

39 18 

76 

39 

31 

29 

39 

43 

84 

39 

56 

41 

630 

39 

69 

00 

39 

81 

61 

39 

94 

24 

40 

06 

89 

40 

19 

56 

40 

32 25 

40 44 

96 

40 

57 

69 

40 

70 

44 

40 

83 

21 

640 

40 

96 

00 

41 

08 

81 

41 

21 

64 

41 

34 

49 

41 

47 

36 

41 

60 25 

41 73 

16 

41 

86 

09 

41 

99 

04 

42 

12 

01 

660 

42 

25 

00 

42 

38 

01 

42 

51 

04 

42 

64 

09 

42 

77 

16 

42 

90 25 

43 03 

36 

43 

16 

49 

43 

29 

64 

43 

42 

81 

660 

43 

56 

00 

43 

69 

21 

43 

82 

44 

43 

95 

69 

44 

08 

96 

44 

22 25 

44 35 

56 

44 

48 

89 

44 

62 

24 

44 

75 

61 

670 

44 

89 

00 

45 

02 

41 

45 

15 

84 

45 

29 

29 

45 

42 

76 

45 

56 25 

45 69 

76 

45 

83 

29 

45 

96 

84 

46 

10 

41 

680 

46 

24 

00 

46 

37 

61 

46 

51 

24 

46 

64 

89 

46 

78 

56 

46 

92 25 

47 05 

96 

47 

19 

69 

47 

33 

44 

47 

47 

21 

690 

47 

61 

00 

47 

74 

81 

47 

88 

64 

48 

02 

49 

48 

16 

36 

48 

30 25 

48 44 

16 

48 

58 

09 

48 

72 

04 

48 

86 

01 

700 

49 

00 

00 

49 

14 

01 

49 

28 

04 

49 

42 

09 

49 

56 

16 

49 

70 25 

49 84 

36 

49 

98 

49 

50 

12 

64 

50 

26 

81 

710 

50 

41 

00 

50 

55 

21 

50 

69 

44 

50 

83 

69 

50 

97 

96 

51 

12 25 

51 26 

56 

51 

40 

89 

51 

55 

24 

51 

69 

61 

720 

51 

84 

00 

51 

98 

41 

52 

12 

84 

52 

27 

29 

52 

41 

76 

52 

56 25 

52 70 

76 

52 

85 

29 

52 

99 

84 

53 

14 

41 

730 

53 

29 

00 

53 

43 

61 

53 

58 

24 

53 

72 

89 

53 

87 

56 

54 

02 25 

54 16 

96 

54 

31 

69 

54 

46 

44 

54 

61 

21 

740 

54 

76 

00 

54 

90 

81 

55 

05 

64 

55 

20 

49 

55 

35 

36 

55 

50 25 

55 65 

16 

55 

80 

09 

55 

95 

04 

56 

10 

01 

760 

56 

25 

00 

56 

40 

01 

56 

55 

04 

56 

70 

09 

56 

85 

16 

57 

00 25 

57 15 

36 

57 

30 

49 

57 

45 

64 

57 

60 

81 

760 

57 

76 

00 

57 

91 

21 

58 

06 

44 

58 

21 

69 

58 

36 

96 

58 

52 25 

58 67 

56 

58 

82 

89 

58 

98 

24 

59 

13 

61 

770 

59 

29 

00 

59 

44 

41 

59 

59 

84 

59 

75 

29 

59 

90 

76 

60 

06 25 

60 21 

76 

60 

37 

29 

60 

52 

84 

60 

68 

41 

780 

60 

84 

00 

60 

99 

61 

61 

15 

24 

61 

30 

89 

61 

46 

56 

61 

62 25 

61 77 

96 

61 

93 

69 

02 

09 

44 

62 

25 

21 

790 

62 

41 

00 

62 

56 

81 

62 

72 

64 

62 

88 

49 

03 

04 

36 

63 

20 25 

63 36 

16 

63 

52 

09 

63 

68 

04 

63 

84 

01 

800 

64 

00 

00 

64 

16 

01 

64 

32 

04 

64 

48 

09 

64 

64 

16 

64 

80 25 

64 96 

36 

65 

12 

49 

65 

28 

64 

65 

44 

81 

810 

65 

61 

00 

65 

77 

21 

65 

93 

44 

66 

09 

69 

66 

25 

96 

66 

42 25 

66 58 

56 

66 

74 

89 

66 

91 

24 

67 

07 

61 

820 

67 

24 

00 

67 

40 

41 

67 

56 

84 

67 

73 

29 

67 

89 

76 

68 

06 25 

68 22 

76 

68 

39 

29 

68 

55 

84 

68 

72 

41 

830 

68 

89 

00 

69 

05 

61 

69 

22 

24 

69 

38 

89 

69 

55 

56 

69 

72 25 

69 88 

96 

70 

05 

69 

70 

22 

44 

70 

39 

21 

840 

70 

56 

00 

70 

72 

81 

70 

89 

64 

71 

06 

49 

71 

23 

36 

71 

40 25 

71 57 

16 

71 

74 

09 

71 

91 

04 

72 

08 

01 

860 

72 

25 

00 

72 

42 

01 

72 

69 

04 

72 

76 

09 

72 

93 

16 

73 

10 25 

73 27 

36 

73 

44 

49 

73 

61 

64 

73 

78 

81 

860 

73 

96 

00 

74 

13 

21 

74 

30 

44 

74 

47 

69 

74 

64 

96 

74 

82 25 

74 99 

56 

75 

16 

89 

75 

34 

24 

75 

51 

61 

870 

75 

69 

00 

75 

86 

41 

76 

03 

84 

76 

21 

29 

76 

38 

76 

76 

56 25 

76 73 

76 

76 

91 

29 

77 

08 

84 

77 

26 

41 

880 

77 

44 

00 

77 

61 

61 

77 

79 

24 

77 

96 

89 

78 

14 

56 

78 

32 25 

78 49 

96 

78 

67 

69 

78 

85 

44 

79 

03 

21 

890 

79 

21 

00 

79 

38 

81 

79 

56 

64 

79 

74 

49 

79 

92 

36 

80 

10 25 

80 28 

16 

80 

46 

09 

80 

64 

04 

80 

82 

01 

900 

81 

00 

00 

81 

18 

01 

81 

36 

04 

81 

54 

09 

81 

72 

16 

81 

90 25 

82 08 

36 

82 

26 

49 

82 

44 

64 

82 

62 

81 

910 

82 

81 

00 

82 

99 

21 

83 

17 

44 

83 

35 

69 

83 

53 

96 

83 

72 25 

83 90 

56 

84 

08 

89 

84 

27 

24 

84 

45 

61 

920 

84 

64 

00 

84 

82 

41 

85 

00 

84 

85 

19 

29 

85 

37 

76 

85 

56 25 

85 74 

76 

85 

93 

29 

86 

11 

84 

86 

30 

41 

930 

86 

49 

00 

86 

67 

61 

86 

86 

24 

87 

04 

89 

87 

23 

56 

87 

42 25 

87 60 

96 

87 

79 

69 

87 

98 

44 

88 

17 

21 

940 

88 

36 

00 

88 

54 

81 

88 

73 

64 

88 

92 

49 

89 

11 

36 

89 

30 25 

89 49 

16 

89 

68 

09 

89 

87 

04 

90 

06 

01 

960 

90 

25 

00 

90 

44 

01 

90 

63 

04 

90 

82 

09 

91 

01 

16 

91 

20 25 

91 39 

36 

91 

58 

49 

91 

77 

64 

91 

96 

81 

960 

92 

16 

00 

92 

35 

21 

92 

54 

44 

92 

73 

69 

92 

92 

96 

93 

12 25 

93 31 

56 

93 

50 

89 

93 

70 

24 

93 

89 

61 

970 

94 

09 

00 

94 

28 

41 

94 

47 

84 

94 

67 

29 

94 

86 

76 

95 

06 25 

95 25 

76 

95 

45 

29 

95 

64 

84 

95 

84 

41 

980 

96 

04 

00 

96 

23 

61 

96 

43 

24 

96 

62 

89 

96 

82 

56 

97 

02 25 

97 21 

96 

97 

41 

69 

97 

61 

44 

97 

81 

21 

990 

98 

01 

00 

98 

20 

81|98 

40 

64 

98 

60 

49 

98 

80 

36 

99 

00 25 

99 20 

16 

99 

40 

09 

99 

60 

04 

99 

80 

01 
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Appendix Table IX 


Sums of the First Three Powers of the Natural Numbers from 1 to 50 


n 

Z(») 

2(n») 

2(»») 

n 

2(n) 

2(n 5 ) 

2(n s ) 

1 

1 

1 

1 

26 

351 

6 201 

123 201 

2 

3 

5 

9 

27 

378 

6 930 

142 884 

3 

6 

14 

. 36 

28 

406 

7 714 

164 836 

4 

10 

30 

100 

29 

435 

8 555 

189 225 

5 

15 

55 

225 

30 

465 

9 455 

216 225 

6 

21 

91 

441 

31 

496 

10 416 

246 016 

7 

28 

140 

784 

32 

528 

11 440 

278 784 

8 

36 

204 

1 296 

33 

561 

12 529 

314 721 

9 

45 

285 

2 025 

34 

595 

13 685 

354 025 

10 

55 

385 

3 025 

35 

630 

14 910 

396 900 

11 

66 

506 

4 356 

36 

666 

16 206 

443 556 

12 

78 

650 

6 084 

37 

703 

17 575 

494 209 

13 

91 

819 

8 281 

38 

741 

19 019 

549 081 

14 

105 

1 015 

11 025 

39 

780 

20 540 

608 400 

15 

120 

1 240 

14 400 

40 

820 

22 140 

672 400 

16 

136 

1 496 

18 496 

41 

861 

23 821 

741 321 

17 

153 

1 785 

23 409 

42 

903 

25 585 

815 409 

18 

171 

2 109 

29 241 

43 

946 

27 434 

894 916 

19 

190 

2 470 

36 100 

44 

990 

29 370 

980 100 

20 

210 

2 870 

44 100 

45 

1 035 

31 395 

1 071 225 

21 

231 

3 311 

53 361 

46 

1 081 

33 511 

1 168 561 

22 

253 

3 795 

64 009 

47 

1 128 

35 720 

1 272 384 

23 

276 

4 324 

76 176 

48 

1 176 

38 024 

1 382 976 

24 

300 

4 900 

90 000 

49 

1 225 

40 425 

1 500 625 

25 

325 

5 525 

105 625 

50 

1 275 

42 925 

1 625 625 
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Five-Place Logarithms of Numbers 

100-160 



709 
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Five-Place Logarithms of Numbers 

160-200 
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328 

346 

364 

507 

525 

543 

686 

703 

721 

863 

881 

899 

♦041 

*058 

*076 

217 

235 

252 

393 

410 

428 

568 

585 

602 

742 

769 

777 


1 2.2 2.1 

2 4.4 4.2 

3 6.6 6.3 

4 8.8 8.4 

6 11.0 10.5 

6 13.2 12.6 

7 15.4 14.7 

8 17.6 16.8 

9 19.8 18.9 
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Five-Place Logarithms of Numbers 

260-500 



log e = .43429 
714 
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Five-Place Logarithms of Numbers 

300-360 
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714 
























Appendix Table X 


Five-Place Logarithms of Numbers 

400-460 



ru 
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716 
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Five-Place Logarithms of Numbers 


600-560 



717 
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Five-Place Logarithms of Numbers 


660-600 



718 
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Five-Place Logarithms of Numbers 

600-650 



719 
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Five-Place Logarithms of Numbers 

650-700 



720 
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Five-Place Logarithms of Numbers 

700-750 




132 I 138 I 144 I 150 156 163 169 175 181 


211 217 224 230 236 242 

272 278 285 291 297 

333 339 345 352 358 364 

394 400 406 412 418 425 

455 461 467 473 479 485 

516 522 528 534 540 546 

576 582 588 594 600 606 

637 643 649 655 661 667 

697 703 709 715 721 727 


745 751 


769 

775 

781 

788 




t 

1 

0.7 

2 

1.4 

3 

2.1 

4 

2.8 

5 

3.5 

6 

4.2 

7 

49 

8 

5.6 

9 

6.3 


6 

1 

0.6 

2 

1.2 

3 

1.8 

4 

2.4 

5 

3.0 

6 

3.6 

7 

4.2 

8 

4.8 

9 

! 5.4 


5 

1 

0.5 

2 

1.0 

3 

1.5 

4 

2.0 

6 

2.5 

6 

3.0 

7 

3.5 

8 

4.0 

9 

4.5 


731 






























Appendix Table X 



835 841 846 

892 898 904 

950 955 901 

*007 *013 *018 
064 070 076 




6 

1 

0.6 

2 

1.2 

3 

1.8 

4 

2.4 

5 

3.0 

6 

3.6 

7 

4.2 

8 

4.8 

9 

5.4 


6 

1 

0.5 

2 

1.0 

3 

1.5 

4 

2.0 

5 

2.5 

6 

3.0 

7 

3.5 

8 

4.0 

9 

1 

4.5 


722 
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Five-Place Logarithms of Numbers 

800-860 



723 
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Five-Place Logarithms of Numbers 

860-900 



1 

6 

0.8 

2 

1.2 

3 

1.8 

4 

2.4 

5 

3.0 

6 

3.6 

7 

4.2 

8 

4.8 

9 

6.4 

1 

5 

0.5 

2 

1.0 

3 

1.5 

4 

2.0 

5 

2.5 

6 

3.0 

7 

3.5 

8 

4.0 

9 

4.5 

1 

4 

0.4 

2 

0.8 

3 

1.2 

4 

1.6 

6 

2.0 

6 

2.4 

7 

2.8 

8 

32 

9 

3.6 


724 
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Five-Place Logarithms of Numbers 

900-950 




0 

1 

0.5 

2 

1.0 

3 

1.5 

4 

2.0 

5 

2.5 

6 

3.0 

7 

3.5 

8 

4.0 

0 

4.5 


4 

1 

0.4 

2 

08 

3 

1.2 

4 

1.6 

5 

2.0 

6 

2.4 

7 

2.8 

8 

3.2 

0 

3.6 
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Abscissa, 9, 64 
Accuracy of estimation, 332 
Actuarial science, 272, 671 
Aggregate, 307 ; use in index numbers, 
165, 181; weighted, 193, 316 
Alfalfa yield, correlation with irriga- 
tion, 404 ff. 

American index numbers of wholesale 
prices, Bradstreet’s index, 166, 182, 
209, 211; Harvard index, 210; U. S. 
Bureau of Labor Statistics, 168, 172, 
176, 193, 216 ff. 

American Telephone and Telegraph 
Co., index of industrial activity, 
312, 390, 393; study of frequency of 
telephone use, 440 
Amplitude of cycles, 238, 285 
Analysis of variance, see Variance 
analysis 

Anderson, Oskar, 454, 487 
Angell, James W., 270 
Anti-logarithm, 24 

Arbitrary origin, 351; in computing 
the mean, 106 

Areas of the normal curve, 436 ff., 699 
Arithmetic mean, 102; computation 
of, 103 ff.; weighted, 104; charac- 
teristics of, 126, 134; as most prob- 
able value, 332; moments about, 
442; of the binomial distribution, 
433, 660; significance of the dif- 
ference between means, 481 ff . ; sig- 
nificance of, in small samples, 
605 ff.; use in correlation analysis, 
537; standard error of, 464 ff., 664 
Arithmetic series, 16, 28, 275 
Array, 52, 82; in computing the cor- 
relation coefficient, 345 
Artillery observations, 90 
Astronomical observations, 89 
Asymmetry, see Skewness 
Average, 86 ff., 99, 101 ff.; relations 
between the several averages, 133; 
moving average, 234 ff . ; of ratios 
to trend, 287 


Average of relative prices, 183 ff., 
196 ff. 

Average relationship between vari- 
ables, 328 

Bank clearings, as index of business 
conditions, 242 

Bar diagram, see Column diagram 
Barlow's tables, 133 
Base period, 316 
Bean, Louis H., 564 
Beckett, S. H., 405 
Beta coefficients, 561 
Bias, of index numbers, 191, 195; of 
the correlation index, 412; in sam- 
pling, 461, 599, 611 
Binomial distribution, 429 ff., deriva- 
tion of the mean and standard 
deviation of, 660 
Birge, Raymond T., 469, 629 
Black, John D., 564 
Blakeman, John, 478 
Bowley, A. L., 159, 201; representa- 
tive sampling, 461; standard error, 
472 

Bradstreet’s index, 166, 182, 209; use 
in deflation, 383 
Burns, Arthur F., 242, 308 
Business, 1; classes of activity, 1; 
quantitative character of its prob- 
lems, 3, 5 

Business cycle, 293 ff.; as indicated by 
moving averages, 242; as measured 
by production changes, 305; pre- 
war relation to stock-price cycles, 
390 ff.; post-war relation, 396; dura- 
tion of, 467, 481; effect on price, 
475; see also Cyclical variation 

Census of manufactures, 309, 317, 371 
Center of gravity, 102 
Central tendency, 97, 99; measures 
of, 101 ff. 

Certainty, in probability theory, 426 
Chaddock, It. E., 84 
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Chain index, 216 

Chance, law of, see Normal law of 
error and Probability 
Characteristic, logarithmic, 24 
Charlier check, 149 
Charts, construction of, 32 ff.; for 
comparison of frequencies, 41 ff.; 
representation of component parts, 
42 ff.; cumulative, 44 ff.; Gannt 
progress chart, 46; see also Graphic 
presentation 

Chi-square, 618; distribution of, 618, 
626; in testing goodness of fit, 
626 ff.; in testing homogeneity, 
633 ff.; in testing independence of 
principles of classification, 630 ff. ; 
table of values, 625, 703 
Classification of quantitative mate- 
rial, see Organization of data 
Classification, principles of, 53 ff.; test- 
ing significance of, 494 ff.; testing 
independence of, 630 ff., 681 ff. 
Classified data, see Grouping of data 
Class interval, 53, 57 ff., 104, 347, 
357; in locating the mode, 117; in 
computing the standard deviation, 
149 

Coefficient of correlation, see Cor- 
relation coefficient 

Coefficient of multiple correlation, see 
Correlation, multiple 
Coefficient of regression, see Regres- 
sion 

Coefficient of variation, 156 
Coin tossing, 91 

Column diagram, 41, 64 ff., 66, 73, 91 
Commodities, included in price- 
change study, 209 

Compound interest, law of, 30 ff., 40; 
curve of, 267 

Concurrence of cycles, 390 ff. 
Constants, 12, 244 ff. 

Controls, in sampling procedure, 462 
Coordinate geometry, 8 ff. 

Correction, of index numbers, 311; 
of the correlation index, 412; of the 
standard error, 542 
Correction factor, in computing the 
correlation coefficient, 339; in com- 
puting the mean, 106, 351, 392; see 
also Bias 

Correlation, coefficient of, 334 ff., 520; 


calculation of, 337 ff., 353, 364, 648; 
product-moment method, 349 ff .> 
construction of table, 340 ff.; sum- 
mary of correlation procedure. 
366 ff.; limitations of, 370; relation 
to correlation ratio, 422; tests for 
the significance of, 502 ff., 611; sig- 
nificance of difference of coef- 
ficients, 616; standard error of, 474; 
derived from small samples, 610; 
weighted average of, 617, 618; table 
of significant values, 612, 701; 
table of relations to the z function, 
702 

Correlation, index of, 408 ff., 520; 
formula for, 408, 412, 647; sig- 
nificance of, 409; computation of, 
410, 412; standard error, 477 

Correlation, linear, 325 ff . ; lines of 
regression, 359 ff.; distortion in 
non-normal distributions, 372 ff.; 
of grouped data, 340 ff.; in the 
measurement of time sequence, 
389 ff . ; see also Correlation, coeffi- 
cient of 

Correlation, multiple, 530 ff.; prelim- 
inary analysis, 533; use of multi- 
variate estimating equation, 536 ff.; 
coefficient of, 543 ff.; correction 
for number of constants involved, 
544; test of significance of the 
coefficient of, 544; standard error of 
the coefficient of, 545; application 
of method, 547 ; limitations of proce- 
dure, 563; simplification of normal 
equations, 652 ff. 

Correlation, non-linear, 404 ff . ; use of 
reciprocal relations, 582; use of 
logarithmic relations, 575 ff. 

Correlation of time scries, 380 ff.; of 
secular trends, 381; of deviations 
from trend, 385; dangers of pro- 
cedure, 388, 389; concurrent cycles, 
391; use of moving average in, 
398; of short term fluctuations, 
398 ff. 

Correlation, partial, 584 ff.; relation 
to simple correlation, 549; system- 
atic computation of coefficients 
of, 554 ff.; standard error of the 
coefficient of, 560 

Correlation, rank, 374 ff., 521 
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Correlation ratio, 413 ff., 519; for- 
' mula for, 414; computation of, 415, 
417 ff.; significance of, 421; cor- 
rection of, 421; relation to correla- 
tion coefficient, 422 
Cost of living index, 221 
Cotton statistics, 161; correlation of 
price and production, 384, 400 
Coverage, of production index num- 
bers, 316 
Cox, G. M., 688 
Criteria of curve type, 444 
Crum, W. L., 302 

Cumulative charts, 44 ff.; arrange- 
ment of data, 77; frequency curve, 
80 

Curve fitting, by least squares, 246 ff . ; 
linear, 246; parabolic, 253, 260 ff.; 
of linear business series, 257 ; use of 
logs in, 264, 269 

Curve type, criteria of, 444; see also 
Functional relationship 
Cutts, Jesse M., 218 
Cycles, correlation of, 382, 389 
Cycles of reference, see Reference 
cycles 

Cyclical fluctuations, correlation of, 
382 ff . ; see also Cyclical variation 
Cyclical variation, 230, 302, 380, 526; 
removal by moving averages, 236, 
284; measurement of, 293 ff.; in 
industrial activity, 312 ff., 390 

Davenport, Donald H., 219, 254, 437, 
573 

Davenport, E., 415 
Day, Edmund E., construction of in- 
dex of physical volume, 310 
Decile, graphic location of, 114 
Deflation, in time series analysis, 279 
Degrees of freedom, in variance 
analysis, 491, 496, 504 ff., 512, 517, 
528; in statistical induction, 604, 
611 

Degree of relationship, see Relation- 
ship, measurement of 
Deming, W. E., 469, 470; Chi-square 
test, 629 

Dennis, Samuel J., 218 
Dependent variable, see Variable 
Depreciation, 79 
Derivative, partial. 639, 643 


Descartes, Ren4, 8 

Description, of frequency distribu- 
tions, 86, 137 ff., 448 ff.; methods 
of, 99 ff.; statistical, 452 ff. 

Deviate, normal, 437 
Deviation, 97; probable, 152; from 
trend, 263, 385, 395; from mean, 
347; vertical and horizontal, 363; 
from moving averages, 398; from 
means of arrays, 417; quartile, see 
Quartile deviation; mean, see Mean 
deviation; standard, see Standard 
deviation; root-mean-square, see 
Root-mean-square deviation 
Differences, finite, 275 
Discount rates, relation between, 
340 ff., 361 ff. 

Dispersion, 99, 115, 137, 414; zone of, 
89 ff., 349; measures of, 137 ff., 
330, 335; in correlation analysis, 
490; test of differences in disper- 
sion, 492; see also Variation and 
Scatter 

Distribution, frequency, 50 ff.; de- 
scription of, 137 ff.; general char- 
acteristics of, 97 ff.; of income, 71; 
of sawmills, 84; of heights, 87; of 
astronomical errors, 89; of artillery 
shots, 90; of coin throws, 91; of 
economic data, 93 ff.; of exchange 
rates, 95; of wage earners, 96; of 
bonds, 116; of stock prices, 125; 
see also List of charts 
Doolittle solution, of normal equa- 
tions, 540, 655 

Dow-Jones index number, 393 

Edgeworth, F. Y., 204; binomial ex- 
pansion, 432 

Elderton, W. P., Chi-square table, 624 
Equation of regression, see Regression 
Error, normal curve of, see Normal 
law of error 

Error, sampling, see Standard error 
Estimate, making of, 332 ff., 566 ff.; 

zone of, 349, 571 ff., 590 ff. 
Exchange rates, distribution of, 94 
Expected value, 294 
Exponential curve, 19, 28, 258, 266, 
271; modified, 272, 667; see also 
Logarithmic curve 
Exponent, logarithmic, 23 
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Exports, statistics of, 36, 38 
Extrapolation, 264, 277, 675 
Ezekiel, Mordecai, 412, 477, 547, 564, 
652; multiple correlation analysis, 
537 ff.; correction of standard error, 
542; correction of the correlation 
coefficient, 544 

Factor reversal test, 199 
Falkner, Helen D., 288 
Farm price index number, 222 ff. 
Federal Reserve Board, index of pro- 
duction, 316 ff. 

Fisher, Arne, 448 

Fisher, Irving, 181; time reversal test, 
190; weighted index numbers, 195, 
196, 204; factor reversal test, 199; 
ideal index number, 201 
Fisher, R. A., 270, 479, 603; statistical 
population, 456; null hypothesis, 
475; analysis of variance, 490 ff.; 
z table, 499, 704-5; extension of z 
table, 518; t table, 603, 700; sig- 
nificance of the correlation coeffi- 
cient, 611; Chi-square table, 625, 
628, 703 

Frazier, Edward K., 105 
Frequency curve, 41, 82; polygon, 
41, 67, 85, 88, 91, 93 
Frequency distribution, 50 ff . ; pur- 
pose of, 56; comparison of, 86; gen- 
eral characteristics of, 97 ff.; see 
also Distribution 

Frequency, theoretical and actual, 
431 ff. 

Friedman, Milton, 521 
Functional relationship, 12, 389; lin- 
ear, see Linear relationship; para- 
bolic, see Parabolic relationship 

Galton, Francis, lines of regression, 
359 

Gannt, H. L., progress chart, 46 
Gauss, Karl Friedrich, normal law of 
error, 435 

Geometric mean, 125 ff . ; definition of, 
125; computation of, 126; charac- 
teristics of, 127, 135; as measure of 
central tendency, 129; as average 
of relative prices, 185; of logarith- 
mic observations, 584 
Geometric progression, 18, 28, 271, 
275, 669 


Glover, James W., 269, 437 
Gompertz curve, 272, 671 
Goodness of fit, 447; criteria of, 276; 

Chi-square test of, 626 ff . 

Graphic method, of locating aver- 
ages, 120 ff.; in multiple correla- 
tion, 564 

Graphic presentation, 8ff.; of fre- 
quency distributions, 63; of time 
series, 227 

Grouping of data, 53, 112; ungrouped 
data, 109; effect on mode, 115; in 
correlation tables, 340, 354 
Growth curves, Gompertz, 272, 671; 
modified exponential, 272, 667 ff.; 
logistic, 272, 675 ff. 

Hall, Lincoln W., 288 
Harmonic equation, 579 
Harmonic mean, 132 ff.; character- 
istics of, 135; of relative prices, 
186; of reciprocal observations, 585, 
587 

Hart, Homell, reliability of a per- 
centage, 483 

Height distribution, 87, 360 
High contact, of frequency distribu- 
tions, 443 

Histogram, 64 ff.; see also Column 
diagram 

Homogeneity, 487; tests for, 120, 
630 ff.; in time series, 301; in sam- 
pling procedure, 462, 607; Chi- 
square test of, 633 ff . 

Hotelling, Harold, 378, 479 
Hyperbolic curve, 16, 28, 569 

Ideal index, 201; for the measurement 
of production, 307 

Income distribution, statistics of, 71, 
97, 102 

Independence, tests of, 630 ff . 
Independent variable, see Variable 
Index numbers, 18; nature of, 161 ff.; 
“ ideal/ ” 201; use of aggregates, 
165; of retail price, 220; of cost of 
living, 221; of farm price, 222 ff.; of 
seasonal variation, 287 ff . ; of in- 
dustrial activity, 312 ff., 390, 393; 
of stock prices, 390 ff . 

Index numbers of production, 305 ff.; 
unadjusted, 306; adjusted, 310; 
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Federal Reserve Board index, 316; 
•derived from price indices, 319 ff.; 
of industrial productivity, 321 
Index numbers of wholesale prices, 
167, 216 ff.; purpose of, 170; con- 
struction of, 180 ff., 208 ff.; aggre- 
gative type, 181; arithmetic aver- 
age type, 183; weighted, 196; of 
farm crop prices, 182, 189; geomet- 
ric average type, 185, 198; median 
type, 185; harmonic type, 186; 
comparison of types, 188; time 
reversal test, 190; weighted types, 
193 ff., 198; alternative types, 
204 ff . ; commodities to be included 
in, 209 

Index of correlation, see Correlation 
index 

Index of variability, 157 
Induction, statistical, 452 ff., 598 ff.; 
nature of, 453; measures of reliabil- 
ity, 464 ff.; generalizing from small 
samples, 598 ff. 

Industrial change, measurements of, 
322 

Inference, statistical, see Induction 
Interaction, of principles of classifica- 
tion, 688 

Interpolation, 70, 81, 277; for the 
median, 114; for the mode, 118; for 
monthly trend values, 273; in 
Fisher's z table, 500; double inter- 
polation, 507 

Irrigation, correlated with alfalfa 
yield, 404 ff . 

Jones, D. C., binomial distribution, 
660 

Karsten, Karl G., 278 
Kelley, Truman L., 206; reliability of 
constants, 485 
Kendall, M. G., 629 
Keynes, J. M., random sampling, 461 
Killough, H. B., 569 
Knibbs, Sir George, 214 
Kurtosis, 100, 137, 159 
Kurtz, Edwin, 77 

Lag, in time series analysis, 390 ff . ; 
changes in different cvcle phases, 
397 


Laspeyre's index number, 193, 214 
Law of large numbers, 455 
Least squares, method of, 246 ff., 
638 ff.; applied to linear relations, 
246, 328, 354, 366, 509; applied to 
power curves, 260, 405; applied to 
logarithmic curves, 264 ff.; in cor- 
relation analysis, 366, 373, 405 
Leptokurtic, 449 
Life table, 77 

Line of regression, see Regression 
Linear correlation, see Correlation, 
linear 

Linearity, test for, 423; by variance 
analysis, 508 ff.; see also Linear 
relationship 

Linear relationship, 14, 16, 26, 325 ff.; 
fitting by least-squares, 246 ff.; in 
business series, 257, 268; between 
discount rates, 348; tests for, 423, 
477, 508 

Link relatives, 204 

Logarithmic, equation, 26 ff., 563, 
569 ff., 671; mean, 128; see also 
Geometric mean; paper, 131, 227; 
deviation, 265; function of the cor- 
relation coefficient, 614 
Logarithms, common, 23 ff., 492, 572; 
use in computing the geometric 
mean, 125, 130; use in curve fitting, 
264 ff., 269; Naperian, 435, 492; 
Appendix table X, 709 
Logistic curve, 272, 675 

Macaulay, F. R., 185, 244 
Malenbaum, Wilfred, 565 
Mantissa, 24 

Manufactured goods, r61e in price 
movements, 213 

Mean, arithmetic, see Arithmetic 
mean; geometric, see Geometric 
mean; harmonic, see Harmonic 
mean 

Mean deviation, 139 ff. 

Mean product, 351, 358 
Measurement of, central tendency, 
see Central tendency; relationship, 
see Relationship, etc. 

Median, definition of, 102; location of, 
109 ff.; computation of, 113; 
graphic location of, 120 ff.; char- 
acteristics of, 134; relation to mean 
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deviation, 140; of relative prices, 
185; standard error of, 472 
Merriman, Mansfield, 90 
Mesokurtic, 449 
Minor, J. R., 556 

Mitchell, W. C., 93, 173, 176, 212, 
242, 303; comparison of index num- 
bers, 209; business cycles, 467, 483 
Mode, 96; definition of, 101; location 
of, 115 ff.; graphic location of, 
120 ff.; characteristics of, 135 
Moments, of frequency distributions, 
440; about the mean, 442 
Monthly trend values, 272 ff. 
Mortality tables, 80 
Moving average, 234 ff.; application 
to non-linear series, 239; measure- 
ment of seasonal fluctuations, 
285 ff.; use in correlating cycles, 
398 ff. 

Mudgett, Bruce D., 216 
Multiple correlation, see Correlation, 
multiple 

Multiple frequency table, 289 

Napierian logarithm, 435 
National Bureau of Economic Re- 
search, 244, 320, 397; study of in- 
come distribution, 132; construc- 
tion of index numbers, 219; study 
of production change, 309 
National Industrial Conference Board, 
cost of living index, 221 
Natural number, 24, 28; table of 
squares of, 706; sums of powers, 708 
New York Census of Manufactures, 
309, 317, 371 

Non-linear correlation, see Correla- 
tion, non-linear 

Non-linear relationship, 404 ff.; see 
also Parabolic and exponential func- 
tion 

Normal deviate, 437, 599; table of, 
603, 699 

Normal equations, for linear relation- 
ships, 249; parabolic, 254; of multi- 
variate relationships, 537 ff.; deriva- 
tion of, 639; formation of, 640; 
checks on, 648; Doolittle solution 
of, 654 

Normal law of error, 98, 153, 425 ff., 
435 ff.; assumptions underlying, 


436; its use, 438; economic appli- 
cation of, 440 ff.; criteria for, 441; 
fitting the normal curve, 445 ff.; 
distribution, 332, 371, 458; de- 
parture from, 374, 378; computa- 
tion of theoretical frequencies, 446; 
generalization of results, 448; of 
the distribution of means, 464; use 
in measures of reliability, 464 ff.; 
area under, 437, 699; test of good- 
ness of fit of, 627 
Null hypothesis, 475 

Ogive, 80 ff., 85 

Organization of data, 51, 82, 100; in 
time series, 226 

Origin, arbitrary, 107, 351; at point 
of averages, 353, 365 
Orthogonal polynomials, 270 

Paasche’s index number, formula for, 
195, 215 

Pabst, Margaret, 378, 479 
Parabolic curve, 16, 21, 27, 270, 577; 

see also Parabolic function 
Parabolic function, fitting of, 253 ff . ; 
second degree, 260, 405; logarith- 
mic, 264, 269, 270; testing para- 
bolic hypothesis, 514 ff. 

Parameter, 457 

Pareto, Vilfredo, law of income dis- 
tribution, 132 

Partial correlation, see Correlation, 
partial 

Peake, E. G., 94 

Peakedness, 100; see also Kurtosis 
Pearl, Raymond, 271, 272; formation 
of normal equations, 642; logistic 
curve, 675 

Pearson, Karl, 156, 158, 254, 436; 
coefficient of correlation, 335; cor- 
relation ratio, 413 ff.; curve types, 
448; descriptive measures of fre- 
quency distributions, 448 ff . ; statis- 
tical inference, 454; Chi-square dis- 
tribution, 618 ff., 626 
Percentages, difference between and 
significance of, 483 
Percentile, 114 

Periodic fluctuation, 230; removal by 
moving averages, 235; see also 
Seasonal and cyclical variation 
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Periodic function, 21 
Persons, Warren M., 204; analysis of 
cycle lags, 390 ff . 

Platykurtic, 449 

Polynomial, orthogonal, 270; see also 
Parabolic function 
Population, statistical, 453, 454, 456 
Potential series, 21 

Power series, 253; see also Parabolic 
function 

Price relative, 162; arithmetic aver- 
age of, 183 

Price, wholesale, 93, 168; index num- 
bers of, 161 ff., 167, 216 ff.; see also 
Index number; price ratios, 171 ff.; 
measurement of change of, 174; 
wholesale groups, 211; index of re- 
tail, 220; of farm products, 222; 
deflation of, 279; measurement of 
variation in, 493 

Probable error, 152, 155; of index 
numbers, 206; see also Standard 
error 

Probability, 603; principles of, 425 ff.; 
addition of, 427; measurement of, 
429; a priori, 431; empirical, 431; 
normal, 439, 459, 471; normal 
table of, 699; integral, 436 
Probability, curve 98; see also Normal 
law of error 
Probable value, 332 
Production, statistics of, 10, 35, 40, 
43, 47; of fuel, 163, 265; of crops, 
192; as measured by index numbers, 
305 ff . ; see also List of charts 
Product-moment method, 349 ff., 
368; for classified data, 354 ff. 
Projection, of trend values, 277, 402 
Purposive selection, in sampling pro- 
cedure, 462 

Quartile, 114; graphic location of, 
120 ff.; deviation, 150 ff., 154; 
standard error of, 473 

Random fluctuations, 231; removal 
by moving averages, 241 
Random sampling, 458, 461; see also 
Sampling 

Range, of variation, 139, 154; semi- 
interquartile, 151 


Rank correlation, 374 ff.; see also 
Correlation, rank 

Rate, of interest, 30, 76, 228; of 
change, 40, 267, 278, 587; of ex- 
change, 94; averaging of, 125 
Ratio, chart, 29, 35 
Ratio, correlation, 413; see also Cor- 
relation ratio 

Reciprocals, use in measuring rela- 
tionship, 578 ff., 675 ff. 

Reed, Lowell, J., 272; logistic curve, 
675 

Reference cycles, 243, 262; correla- 
tion of, 382 
Regimen, 214, 322 

Regression, lines of, 359 ff.; use of, 
364 ff., 367, 423, 607; coefficient of 
regression, 359 ff., 363, 479, 561, 
607; for cotton production and 
price, 387, 401; standard error of 
coefficient of regression, 479, 607, 
609 

Relationship, between income and 
auto registration, 326 ff., 352; meas- 
urement of, 325 ff., 334; between 
discount rates, 340 ff.; between 
time series, 380 ff . ; temporal, 391 ff . ; 
linear, see Linear relationship 
Relative deviations, 129; weighted, 
167 

Relative price, 162; arithmetic aver- 
age of, 183; geometric average, 
185, 198; harmonic average, 187; 
weighted average, 196 
Relative variation, measurement of, 
156 ff., 264 

Reliability, measures of, 464; of the 
mean, 464; of the difference be- 
tween means, 481, 483; of the me- 
dian, 472; of the standard deviation, 
473; of the coefficient of correlation, 
474; index of correlation, 477; coeffi- 
cient of regression, 478 
Residuals, 247 

Residual variability, see Variability, 
residual 

Retail price, index of, 220 

Richardson, A. H., 49 

Rietz, H. L., 143 

Robertson, R. D., 405 

Robinson, G., 88, 274, 465 

Root-mean-square deviation, 146. 
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276, 330, 416; see also Standard 
deviation 
Rulon, P. J., 485 

Sample, size of, 117; estimates from, 
146; in constructing index numbers, 
206 

Sampling, problem of, 452 ff., 460; 
random, 458, 461; generalizing from 
small samples, 598 ff.; errors of, 
293, 447; see also Standard error 
Sasuly, Max, 270 
Scale, for curve reading, 39 
Scatter, 99; degree of, 137, 334, 409, 
414, 646; see also Variation 
Scatter diagram, 326, 328, 348, 370, 
416 

Scott, Frances V., 219 
Seasonal variation, 230, 284 ff., 380; 
removal by moving averages, 235; 
measurement of, 287 ff. ; adjustment 
of, 317; test of significance of, 522 ff. 
Secular trend, 229, 380, 487 ; of cotton 
production and price, 383, 385; 
measurement of, 231 ff.; represen- 
tation by moving average, 234; 
by mathematical curves, 244 ff., 
667 ff.; of business series, 257 ff.; 
selection of curve, 274 ff . 

Selection of curve of trend, 274 
Semi-interquartile range, 151 
Semi-logarithmic charts, 28, 264; 
advantages of, 40 

Series, periodic, 21; potential, 21; 
continuous, 75 

Sheppard, W. F., correction for 
grouping, 150, 442 ff.; table of 
normal areas, 436 

Shewhart, W. A., 49; distribution of 
the standard deviation, 600 ff . 
Significance, tests of, 464 ff.; see also 
Standard error 
Significant figures, 485 
Sine curve, 21 

Skewness, 96; measures of, 100, 122, 
137 ff., 157 ff., 449; of geometric 
series, 129; of the standard devia- 
tion, 600; of the correlation co- 
efficient, 610 

Slope, 293; of regression line, 336, 
350, 359, 361; see also Regression 
coefficient 


Smoothing of curves, 69 ff., 76, 117 
Snedecor, George W., 449, 688 
Snyder, Carl, 229 
Spurr, W. A., 293 

Squares of natural numbers, table of, 
706 

Standard deviation, 145 ff., 330, 
371, 416; characteristic features of, 
155; use in adjusting index num- 
bers, 311, 393, 395; in terms of 
moments, 443; about the means 
of arrays, 418; use of, in variance 
analysis, 491, 494; see also Standard 
error 

Standard error, of the binomial dis- 
tribution, 434, 660; of the mean, 
464, 664; of the difference of 
means, 481, 483; of the median, 
472; of the standard deviation, 473; 
of the correlation coefficient, 474, 
545; of the correlation index, 477; 
of the regression coefficient, 478; of 
the partial correlation coefficient, 
560, 615; of the z function, 493, , 
615; limitations of above measures, 
486 ff. 

Standard error of estimate, 330 ff.; 
computation of, 333, 338, 370, 388, 
401, 406, 590; short-cut calculation, 
346, 354; of parabolic functions, 
410; significance of, 349, 371; about 
line of regression, 480; correction 
of, 413, 542; in multiple correla- 
tion analysis, 534, 541 ff.; of loga- 
rithmic functions, 571 ff.; in ratio 
terms, 573; in reciprocal terms, 
581 ; zones of estimate, 590 ff . 

Starr, G. W., 312 
Statistic, 457 

Statistical description, see Descrip- 
tion 

Statistical induction; see Induction 
Steinmetz, C. P., 256 
Stewart, Ethelbert, 84 
Stock price cycles, relation to busi- 
ness activity, 390, 397 
Straight line, fitting of, 246; see also 
Linear relationship 
Stratification, in sampling procedure, 
462 

Stratified purposive sampling, 463; 
standard error of, 472 
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“ Student,” standard error of the rank 
correlation coefficient, 479; stand- 
ard error of the mean, 599; distribu- 
tion of the standard deviation, 
600 ff. 

Sturges, H. A., bl 
Symbols, glossary of, 691 ff. 
Symmetry, 100; degree of, 120; see 
also Skewness 

Table, of areas under the normal 
curve, 699; Fisher t table, 603, 
700; of significant values of the cor- 
relation coefficient, 612, 701; of 
relations of the correlation coeffi- 
cient to the z function, 702; of the 
distribution of z, 704—5; of the 
powers of natural numbers, 706, 
708; of common logs, 709 
Tabulation of data, 51, 62; in cor- 
relation tables, 341, 354, 415 
Tendency, central; see Averages and 
Central tendency 
Thompson, F. L., 292 
Time reversal test, 190 
Time series, charts, 33, 48, 50; 
analysis of, 225 ff., 295; graphic 
representation, 227; removal of 
cycles, 234; fitting a line to, 252; 
measurement of seasonal fluctua- 
tion, 284 ff.; of cyclical fluctuation, 
284; measurement of relations be- 
tween, 380 ff.; see also Correlation 
of time series 
Tolley, H. R., 537, 652 
Trend, 262; of price movements, 170; 
of monthly values, 272; selection of 
curve of, 274 ff.; measurement of, 
225 ff.; secular, see Secular trend 

Ungrouped data, 109; product mo- 
ment method for, 352 
Uniformity of nature, principle of, 
457 

Unweighted index number, 184 
U. S. Bureau of Internal Revenue, 
326 

U. S. Bureau of Labor Statistics, 
statistics of fuel production, 164; 
index of wholesale prices, 168, 172, 
176, 212, 216 ff., 282; index number 


used, 193; index of retail prices, 
220; cost of living index, 221 
U. S. Department of Agriculture, in- 
dex of farm prices, 222 

Variability, measures of, 490 ff., 560; 
between classes, 494 ff.; absolute, 
586; residual, 526, 689; see also 
Variance and variation 
Variable, 11; relations between vari- 
ables, 325 ff., 359, 360 
Variance, analysis of, 490 ff.; z test 
of difference in variability, 492, 
506, 513, 517; in testing variability 
between classes, 494; in the meas- 
urement . of relationship, 501 ff., 
519; in testing linearity, 508 ff.; 
curvilinear hypothesis, 514 ff.; test- 
ing seasonal fluctuation, 522 ff . ; 
in testing the multiple correlation 
coefficient, 545; in testing signifi- 
cance of principles of classification, 
681 ff. 

Variation, 97; measures of, 99, 137 ff., 
330; absolute, 138; comparison of 
measures of, 153, 155; measures of 
difference in, 490 ff.; coefficient of, 
156; in price relatives, 171 ff.; 
within and between arrays, 502; 
see also Seasonal and cyclical fluc- 
tuation 

Verhulst, P. F., 272 

Wage statistics of, 96, 103, 105, 111, 
124 

Wahr, George, 437 

Walsh, C. M., 130, 201; ratio variabil- 
ity, 596 

Weighted average, 104, 106; of rela- 
tive prices, 106; geometric, 125; 
moving average, 244 
Weldon, W. F. R., dice experiment, 
432, 618 

Wheat, exports of, 33; yield corre- 
lated with fertilizer, 415 
Whipple, G. C., 87 
Whittaker, E. T., 88, 274, 465 
Wholesale price, 211 ff.; index of, 
216 ff.; see also Price 
Working, Holbrook, harmonic mean. 
588 
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Yates, F., 688 z test of variability, 492, 506, 514, 

Yule, G. U., 60, 418; Chi-square fre- 517; tables of, 704-5; standard error 
quencies, 622, 629 of, 615 

z transformation of correlation coeffi- 
Zone, of estimate; see Estimate, Dis- dent, 613 ff., 702 
persion 







